[inria-00269967, v1] gene selection in cancer classification using gpso/svm and ga/svm hybrid algorithms
Author manuscript, published in "Congress on Evolutionary Computation, Singapor : Singapour (2007)"
Gene Selection in Cancer Classification using PSO/SVM and GA/SVM Hybrid Algorithms
Enrique Alba, Jos´e Garc´ıa-Nieto, Laetitia Jourdan and El-Ghazali Talbi
Abstract— In this work we compare the use of a Particle
a filter method, is in definition, independent of the learning
Swarm Optimization (PSO) and a Genetic Algorithm (GA)
algorithm used after it. The second one, the wrapper method,
(both augmented with Support Vector Machines SVM) for
which carries out the feature subset selection and classifi-
the classification of high dimensional Microarray Data. Both
cation in the same process, engages a learning algorithm
algorithms are used for finding small samples of informative genes amongst thousands of them. A SVM classifier with 10-
to measure the classification accuracy. From a conceptual
fold cross-validation is applied in order to validate and evaluate
point of view, wrapper approaches are clearly advantageous,
the provided solutions. A first contribution is to prove that
since the features are selected by optimizing the discriminate
P SOSV M is able to find interesting genes and to provide clas-
power of the finally used induction algorithm. sification competitive performance. Specifically, a new version of PSO, called Geometric PSO, is empirically evaluated for
Feature selection for gene expression analysis in cancer
the first time in this work using a binary representation in
prediction often uses wrapper classification methods [7] to
Hamming space. In this sense, a comparison of this approach
discriminate a type of tumor, to reduce the number of genes
with a new GASV M and also with other existing methods of
to investigate in case of a new patient, and also to assist
literature is provided. A second important contribution consists
in drug discovery and early diagnosis. Several classification
in the actual discovery of new and challenging results on six public datasets identifying significant in the development of a
algorithms could be used for wrapper methods, such as K-
variety of cancers (leukemia, breast, colon, ovarian, prostate,
Nearest Neighbor (K-NN) [8] or Support Vector Machines
and lung).
(SVM) [9]. By creating clusters a big reduction of the numberof considered genes and an improvement of the classification
Microarray technology (DNA microarray) [1] allows to
The definition of the feature selection problem is this:
simultaneously analyze thousands of genes and thus can give
given a set of features F = {f1, ., fi, ., fn}, find a subset
important insights about cell’s function, since changes in
F ⊆ F that maximizes a scoring function Θ : Γ → G such
the physiology of an organism are generally associated with
changes in gene expression patterns. Several gene expres-
F = argmaxG⊂Γ{Θ(G)},
sion profiles obtained from tumors such as Leukemia [2],Colon [3] and Breast [4] have been studied and compared to
where Γ is the space of all possible feature subsets of
expression profiles of normal tissues. However, expression
F and G a subset of Γ. The optimal feature selection
data are highly redundant and noisy, and most genes are
problem has been shown to be NP-hard [10]. Therefore,
believed to be uninformative with respect to studied classes,
only heuristics approaches are able to deal with large size
as only a fraction of genes may present distinct profiles for
problems. Recently, such advanced structured methods have
different classes of samples. So, tools to deal with these
been used to explore the huge space of feature subsets, like
issues are critically important. These tools should learn to
for example metaheuristics as Evolutionary Algorithms and,
robustly identify a subset of informative genes embedded
specifically, Genetic Algorithms (GAs) [11], [5], [12].
out of a large dataset which is contaminated with high-
In this work, we are interested in gene selection and
classification of DNA Microarray data in order to distin-
In this context, feature selection is often considered as
guish tumor samples from normal ones. For this purpose,
a necessary preprocess step to analyze these data, as this
we propose two hybrid models that use metaheuristics and
method can reduce the dimensionality of the datasets and
classification techniques. The first one consists of a Particle
often conducts to better analyses [6].
Swarm Optimization (PSO) [13] combined with a SVM
Two models of feature selection exist depending on
approach. PSO is a population based metaheuristic inspired
whether the selection is coupled with a learning scheme or
by the social behavior of bird flocking or fish schooling.
not. The first one, filter model, which carries out the feature
Specifically, a recent version called Geometric PSO [14]
subset selection and the classification in two separate phases,
(explained in Section II) has been used in this work. The
uses a measure that is simple and fast to compute. Hence,
second model is based on the popular GA using a spe-
E. Alba and J. Garc´ıa-Nieto are with the Department of Lengua-
cialized Size-Oriented Common Feature Crossover Operator
jes y Ciencias de la Computaci´on, Universidad de M´alaga, (email:
(SSOCF) [15], which keeps useful informative blocks and
{eat,jnieto}@lcc.uma.es, url http://neo.lcc.uma.es)
produces offsprings which have the same distribution than
L. Jourdan and E-G. Talbi are with the LIFL/INRIA Futurs-Universit´e
de Lille 1, Bˆat M3-Cit´e Scientifique, (email: {jourdan,talbi}@lifl.fr, url
the parents. This model will be also combined with SVM in
Both proposed approaches are experimentally assessed on
six well-known cancer datasets (Leukemia, Colon, Breast,
Support Vector Machines, a technique derived from statis-
Ovarian [16], Prostate [17] and Lung [18]), discovering new
tical learning theory, is used to classify points by assigning
and challenging results and identifying specific genes that
them to one of two disjoint half spaces [9]. So, SVM
our work suggests as significants. Performances of proposed
performs mainly a (binary) 2-class classification. For linearly
PSO and GA algorithms solving the gene extraction problem
separable data, SVM obtains the hyperplane which maxi-
(using SVM) are compared in this paper. Specifically, we fo-
mizes the margin (distance) between the training samples
cused in the capacity of the P SOSV M combination in order
and the class boundary. For non linearly separable cases,
to provide considerable performance in this matter. In this
samples are mapped to a high dimensional space where
sense, comparisons with several state of art methods show
such a separating hyperplane can be found. The assignment
competitive results according to the conventional criteria.
is carried out by means of a mechanism called the kernel
The outline of this work as follows. We review the PSO
and the SVM techniques in order to introduce our P SOSV M
SVM is widely used in the domain of cancer studies,
hybrid model in Section II. In Section III, the six microarray
protein identification and specially in Microarray data [6],
datasets used in this study are described. Experimental results
[12], [19]. Unfortunately, in many bioinformatics problems
are presented in Section IV, including biological descriptions
the number of features is significantly larger than the number
of several obtained genes. Finally, we summarize our work
of samples. For this reason, tools for decreasing the number
and present some conclusions and possible future work in
of features in order to improve the classification or to help
to identify interesting features (genes) in noisy environmentsare necessary. In addition, SVM can treat data with a large
number of genes, but it has been shown that its perfor-
mance is increased by reducing the number of genes [20].
In this section, we describe the hybrid P SOSV M approach
The hybrid PSO and hybrid GA approaches next proposed
for gene selection and classification of Microarray data. The
PSO algorithm is designed for obtaining gene subsets assolutions in order to reduce the high number of genes to
be later classified. The SVM classifier is used whenever the
In order to offer a basic idea of the operation of our
fitness evaluation of a tentative gene subset is required. P SOSV M approach, in Figure 1, we can observe a sim-ple scheme of how features are extracted from the initial
microarray dataset and how the resulted subset is evaluated.
Particle Swarm Optimization was first proposed by
In a first phase, the metaheuristic algorithm involved, PSO
Kennedy and Eberhart in 1995 [13]. PSO is a population
in this case, provides a binary encoded particle1 where each
based evolutionary algorithm inspired in the social behavior
bit2 represents a gene. If a bit is 1, it means that this gene
of bird flocking or fish schooling. In the description of PSO,
is kept in the subset and 0 indicates that the gene is not
the swarm is made up of a certain number of particles (sim-
included in the subset. Therefore, the particle length is equal
ilar to population of individuals in EAs). At each iteration,
to the number of genes in the initial microarray dataset.
all the particles move in the problem space to find the global
The original PSO was initially developed for continuous
optima. Each particle has a current position vector and a
optimization problems. However, lots of practical enginee-
velocity vector for directing its movement.
ring problems are formulated as combinatorial optimizationproblems and specifically as binary decisions. Several binary
= ω · vki + ϕ1 · rnd1 · (pBesti − xki) + ϕ2 · rnd2 · (gi − xki) (2)
versions of PSO can be found in present literature [21], [22]. Nevertheless, these versions consist on ad hoc adaptations
from the original PSO and therefore their performances areusually improvable.
Equations 2 and 3 describe the velocity and position update
With the aim of facing the gene selection problem, an
of a given particle i at a certain iteration k. Equation 2
innovative version of PSO, based on the geometric frame-
calculates a new velocity vi for each particle (potential
work presented in [14], has been developed in this work.
solution) based on its previous velocity, the particle’s location
This version, called Geometric Particle Swarm Optimiza-
at which the best fitness so far has been found pBesti,
tion (GPSO), enables to us to generalize PSO to virtually
and the population global (or local neighborhood, in the
any solution representation in a natural and straightforward
neighborhood version of the algorithm) location at which
way. This property was demonstrated for the cases of Eu-
the best fitness so far has been achieved gi. Individual and
clidean, Manhattan and Hamming spaces in the referenced
social weight are represented by means of ϕ1 and ϕ2 factors
work. Even a recently appeared work [23], uses the GPSO
respectively. Finally, rnd1 and rnd2 are random numbers
for solving the Sudoku Puzzle by means of permutations
in range {0, 1}, and ω represents the inertia weight factor. Equation 3 updates each particle’s position xi in solution
1chromosome in GA and solution (S) in Figure 1
Subset Evaluation Solution (S). Provided by Metaheuristic (PSO, GA) SVM – Classification 10-Fold Cross Validation Microarray Initial Dataset 1 0 1 1 . 0 0 1 Fitness(S)=Accuracy & Subset Size
A simple scheme of how features (genes) are selected out from the original microarray dataset using a particle with binary encoding. In a second
phase, the resulted subset is evaluated by means of a SVM classifier and 10-fold cross validation to obtain the fitness value (accuracy) of such particle.
representation. Since the gene selection problem has been
Algorithm 1 Pseudocode of the GPSO for Hamming space.
represented by binary way, specific operators for Hamming
space were used in the PSO described here.
2: while not stop condition do 3: for each particle xi of the swarm S do D. Geometric Particle Swarm Optimizationif f itness(xi) is better than fitness(hi) then
In this version, the location of each particle i is represented
as vector xi = xi1, xi2, ., xiN taking each bit xij (with
if f itness(hi) is better than fitness(gi) then j in {1, N }) binary values 0 or 1. The key issue of the
GPSO is the concept of particle movement. In this approach,
instead of the notion of velocity added to the position, a
for each particle xi of the swarm S do three-parent mask-based crossover (3PMBCX) operator is
xi ← 3P MBCX((xi, w1), (gi, w2), (hi, w3))
applied to each particle in order to “move” it. According to
the definition of 3PMBCX [14], given three parents a, b and
16: end while c in {0, 1}n, generate randomly a crossover mask of length n
17: Output: best solution found
with symbols from the alphabet {a, b, c}. Build the offspringfilling each element with the bit from the parent appearingin the crossover mask at the position.
a second phase, after the evaluation of particles (line 4),
The pseudocode of the GPSO algorithm for Hamming
historical and social position are updated (lines 5 to 10).
spaces is illustrated in Algorithm 1. For a given particle i,
Finally, particles are “moved” by means of the 3PMBCX
three parents take part in the 3PMBCX operator (line 13): the
operator (line 13). In addition, with a probability of 10%,
a simple bit-mutation operator (line 14) is applied in order
i, the social best position gi and the histori-
to avoid the early convergence. This process is repeated
i (of this particle). The weight values
w1, w2 and w3 indicate for each element in the crossover
until reach the stop condition fixed to a certain number of
mask the probability of having values from the parents xgi or hi respectively. These weight values associated to each
parent represent the inertia value of the current position (w1),the social influence of the global/local best position (w2) and
Since the position of a particle xi represents a gene subset,
the individual influence of the historical best position found
the evaluation of each particle is carried out by means of the
(w3). A constriction of the geometric crossover forces w1,
SVM classifier to assess the quality of the represented gene
w2 and w3 to be non-negative and add up to one.
subset. The fitness of a particle xi is calculated applying a 10-fold Cross Validation (10FCV) method to calculate the rate of
In summary, the GPSO developed in this study oper-
correct classification (accuracy) of a SVM trained with this
ates as follows: In a first phase of the pseudocode, the
gene subset. In 10FCV, the data set is divided into 10 subsets.
initialization of particles are carried out by means of the
Each time, one of the 10 subsets is used as the test set and
SwarmInitialization() function (Line 1). This special
the other 9 subsets are put together to form the training set.
initialization method (used also in our GA approach) was
Then the average error across all 10 trials is computed. The
adapted to gene selection as follows. The swarm (population)
complete fitness function is described in Equation 4.
was divided into four subsets of particles (chromosomes)initialized in different ways depending on the number of
f itness(x) = α · (100/accuracy) + β · #f eatures, (4)
features in each particle. That is, 10% of particles wereinitialized with N (prefixed value) selected genes (1s) located
where α and β are weight values set to 0.75 and 0.25
randomly. Another 20% of particles were initialized with 2N
respectively in order to control that the accuracy value
genes, 30% with 3N genes and finally, the rest of particles
takes priority over the subset size, since high accuracies are
(40%) were initialized randomly and 50% of the genes were
preferred when guiding the search process. The objective
turned on. In these experiments N will be equal to 4. In
here consists of maximizing the accuracy and minimizing
the number of genes (#f eatures). For convenience (only
All experiments were carried out using a PC with Linux
minimization of fitness) the first factor is presented as
O.S (Suse 9.0 with kernel 2.4.19) and a Pentium IV 2.8GHz
processor, with 512MB of RAM. P SOSV M and GASV Malgorithms on six cancer related microarray datasets were
independent executed 10 times over each dataset, in order to
Instances used in this study consists of six well-
have statistically meaningful conclusions as both algorithms
known datasets issued of microarray experiments, ALL-
AML Leuke-mia dataset, Breast cancer dataset, Colon tumordataset, Ovarian cancer dataset, Prostate cancer dataset, and
Lung cancer dataset. All of them were taken from the public
The parameters used in our PSO and GA algorithms
Kent Ridge Bio-medical Data Repository with url http://
are shown in Table I. These parameter were selected after
sdmc.lit.org.sg/GEDatasets/Datasets.html.
several test evaluations of each algorithm and dataset instance
◦ The ALL-AML Leukemia dataset consists of 72 mi-
until reach the best configuration in terms of the quality of
croarray experiments with 7129 gene expression le-
solutions and the computational effort.
vels. Two classes for distinguishing: Acute MyeloidLeukemia (AML) and Acute Lymphoblastic Leukemia
(ALL). The complete dataset contains 25 AML and 47
PSO AND GA PARAMETERS FOR GENE SUBSET SELECTION AND
◦ The Breast cancer dataset consists of 97 experiments
with 24481 gene expression levels. Patients studied
Parameter Parameter
show two classes of diagnosis called relapse with 46
patients and non-relapse with 51 ones. ◦ The Colon tumor dataset consists of 62 microarray
experiments collected from colon-cancer patients with2000 gene expression levels. Among them, 40 tumorbiopsies are from tumors and 22 (normal) biopsies are
from healthy parts of the colons of the same patients. ◦ The Lung cancer dataset involves 181 microarray ex-
Several observations can made based on the above expe-
periments with 12533 gene expression levels. Classifi-
riments, so we tackle the analysis of results focusing on the
cation occurs between Malignant Pleural Mesothelioma
performance and robustness of our algorithms, as well as
(MPM) and Adenocarcinoma (ADCA) of the lung. In
the quality of the obtained solutions providing a biological
tissue samples there are 31 MPM and 150 ADCA.
description of most significant ones. ◦ The Ovarian cancer dataset consists of 253 microarray
1) Performance Analysis: From the point of view of
experiments with 15154 gene expression levels. The
the performance, both algorithms obtain in a few iterations
goal of this experiment is to identify proteomic patterns
acceptable results in gene selection, providing reduced sub-
in serum that distinguish cancer from non-cancer scenar-
sets with high classification rates. However, the behavior
ios. The dataset includes 162 (of 253) ovarian cancers
is slightly different. Figure 2 shows a graphical evolution,
in terms of the average of the fitness value, of a typical
◦ The Prostate cancer dataset involves 136 microarray
execution of P SOSV M and GASV M . It is noticeable that in
experiments with 12600 gene expression levels. Two
few iterations (4 or 5) the average of fitness decrease quickly
classes must be differentiated: tumor with 77 (52 + 25)
and then stop in similar solutions. The large diversity of
samples and non-tumor with 59 (50+9) samples.
solutions provided in the initialization method (Section II-B)provokes fast of good solutions and the early convergence
of both methods. Although the GASV M generally obtains
For our P SOSV M approach, the PSO was imple-
lower average than P SOSV M , whose solutions have in turn
mented in C++ following the skeleton architecture of the
MALLBA [24] library. For the GASV M approach the GA
Results for all the datasets are shown in Table II. Columns
was implemented in C++ using the ParadisEO [25] Frame-
2 and 3 contain the average of the best solutions obtained
work. The GA implements a generational evolution strategy
in 10 independent executions of P SOSV M and GASV M
(offspring replacement with elitism) and uses the follow-
respectively. Six state of the art methods from literature are
ing operators: deterministic tournament selection, SSOCF
presented in columns 4 to 10 in order to show how our
crossover, and uniform mutation. The SVM classifier used
proposals actually push forward the research in this area.
in both approaches is based on the LIBSVM [26] library.
Cells in - haven’t values to our knowledge. Standard criteria
For the SVM confi-guration, the same parameters were used
are used to compare the results: the classification accuracy
in PSO and GA algorithms and the Kernel function was
in terms of the rate of correct classification (first value in
configured as Linear. The fitness function used in GASV M
every table cell) and the number of selected genes (the value
is the same one (described in Section II) as in P SOSV M .
COMPARISON OF RELEVANT WORKS ON CANCER CLASSIFICATION WITH PROPOSED MODELS GASV M AND P SOSV M . IN BOLD WE MARK THE
MOST ACCURATE RESULTS. CELLS WITHOUT KNOWN VALUE (TO US) ARE MARKED WITH - CHARACTER
Luekemia Prostate 2) Algorithm Robustness: One of the most important
criteria in evaluating any proposed algorithm is the quality
of the algorithm and its ability to generate similar (identical)
outcomes when executed several times. This factor is veryimportant for metaheuristics which is the case in this work.
To examine the robustness of the two proposed approaches,
AVG Fitness
in some instances, in all ten runs both algorithms manages
to find the same answer or similar ones (not identical). However, it is worthwhile mentioning that the total accuracy
and number of selected features in all the cases did notdeviate from each other by more than 5.5. Table III shows the
Number of Iterations
result of running the GASV M and P SOSV M algorithms interms of statistical results, reporting the Best solution found,
Fig. 2. Evolution of the average fitness (AVG Fitness) in a typical execution
Mean and Standard Deviation of ten independent runs.
of P SOSV M and GASV M approaches using the Leukemia dataset.
COMPARISON IN TERMS OF STATISTICAL RESULTS OF THE GASV M
In this comparison, we can observe that all solutions
AND P SOSV M APPROACHES. THE Best SOLUTION FOUND, Mean AND
SV M and GASV M algorithms present
a classification rate higher than 86%, and subsets with
four and less than four selected genes are common. We
Leukemia
outperform all the existing results (to our knowledge) but
one case [6] presents a smaller subset of 2 genes. We suspect
that the initialization method used in our work helps the
performance of the algorithms significantly, finding small
Prostate
subsets with a high classification accuracy. PSO with SVMalgorithm was also used in [30] (PSO/SVM from now on),
3) Brief Biological Analysis of Selected Genes: Finally, a
although showing several differences with respect to our
summary of the best subsets of genes found for each dataset
P SOSV M . In the first place, in PSO/SVM the binary values
is shown in Table IV. All subset of genes which reported
of each particle are chosen at each iteration by means of a
close to 100% test accuracy and the minimum number of
decision function based on threshold parameters, whereas in
genes. It is remarkable that apparently (to our knowledge)
P SOSV M , binary particles evolve following a completely
several discovered genes that has not been seen in any past
different mechanism, that is, 3PMBCX crossover and muta-
studies. In this sense, we can provide a brief biological
tion operators (Subsection II-D). In second place, during the
description of some of the most frequently obtained genes
evaluation phase, a leave-half-out cross-validation method
since they are currently used in the design of drugs and
was used in PSO/SVM whereas the P SOSV M validates the
cancers treatment. Some of which are listed below:
selected subsets by means of 10 k fold cross-validation. • Gene L12052 at is “CAMP phosphodiesterase mRNA,
If we compare our GPSO and GA metaheuristics com-
3’ end” which is used in drugs like Anagrelide or
bined with SVM similar results are found. In general, the GA
Milrinone. Specifically the Anagrelide is used for the
approach obtain better best solutions (Hits) although the best
treatment of essential thrombocytosis, and it was proved
classification was provided by GP SOSV M for the Colon
to be effective in treating patients with certain kinds of
tumor dataset (100% accuracy and 2 genes in Table IV).
leukemia such as chronic myeloid leukemia [31]. This
From the point of view of the accuracy average in all
gene belongs to a set of 3 genes (reported from leukemia
independent runs, the GPSO obtain a better performance
dataset in Table IV) with 100% accuracy selected by the
although the difference with regard to the GA (as shows
• Gene AB022847 is a “Solute carrier family 6 (neu-
SUBSETS OF GENES REPORTED WITH 100% TEST ACCURACY
Leukemia Prostate Independent runs Independent runs
Fig. 3. Accuracy obtained by P SOSV M and GASV M in each independent run. Legends specify the datasets with the number of features in parenthesis.
rotransmitter transporter, noradrenalin), member 2” lo-
belongs to a set of 4 genes (reported from lung dataset in
cated in plasma membrane. Current drugs like Radax-
Table IV) with 100% accuracy selected by the GASV M .
afine, Amphetamine or Venlafaxine are associated withthis gene. Specifically, Venlafaxine is a prescription
antidepressant first introduced by Wyeth in 1993. It
In this work, two hybrid techniques for gene selection
is specifically used in management of hot flashes in
and classification of high dimensional DNA Microarray data
survivors of breast cancer [32]. This gene belongs to a
were presented and compared. These techniques are based
set of 3 genes (reported from breast dataset in Table IV)
on different metaheuristic algorithms such as GPSO and GA
with 95.8763% accuracy selected by the P SOSV M .
used for feature selection using the SVM classifier to identify
• Gene 36245 at is “5-hydroxytryptamine (serotonin) re-
potentially good gene subsets. Specifically, the Geometric
ceptor 2B” located in human plasma membrane. There
PSO algorithm for Hamming space was used to solve a real
are several drugs where this gene is used like Risperi-
problem (gene selection in this case) for the first time (to
done, Blonanserin and Mirtazapine. Some studies con-
our knowledge). In addition, genes selected are validated by
sider Mirtazapine as the first-choice agent for anxiety
an accurate 10-fold cross validation method to improve the
and depression after lung transplantation. This gene
Both approaches (P SOSV M and GASV M ) were experi-
[12] E. B. Huerta, B. Duval, and J.-K. Hao, “A Hybrid GASVM Approach
mentally assessed on six well-known cancer datasets disco-
for Gene Selection and Classification of Microarray Data,” in LectureNotes in Computer Science of EvoWorkshops, F. Rothlauf, J. Branke,
vering new and challenging results, and identifying specific
S. Cagnoni, E. Costa, C. Cotta, R. Drechsler, E. Lutton, P. Machado,
genes that our work suggests as significant ones. In this
J. H. Moore, J. Romero, G. D. Smith, G. Squillero, and H. Takagi,
sense, comparisons with several state of art methods show
[13] J. Kennedy and R. Eberhart, “Particle Swarm Optimization,” in Proc.
competitive results according to standard evaluation. Results
of the IEEE International Conference on Neural Networks, vol. 4,
of 100% classification rate and few genes per subset (3 and
4) are obtained in most of our executions. The use of an
[14] A. Moraglio, C. D. Chio, and R. Poli, “Geometric Particle Swarm
Optimization,” in 10th European conference on Genetic Programming
adapted initialization method has shown a great influence on
(EuroGP 2007), ser. Lecture Notes in Computer Science, vol. 4445.
the performance of proposed algorithms, since it introduces
an early set of acceptable solutions in their evolution process.
[15] L. Jourdan, C. Dhaenens, and E.-G. Talbi, “A genetic algorithm for
feature selection in data-mining for genetics,” in Proceedings of the
Continuing the line of this work, we are interested in deve-
4th Metaheuristics International ConferencePorto (MIC’2001), Porto,
loping and testing several combinations of other metaheuris-
tics with classification methods in order to discover new and
[16] E. F. Petricoin, A. M. Ardekani, B. A. Hitt, P. J. Levine, V. A. Fusaro,
S. M. Steinberg, G. B. Mills, C. Simone, D. A. Fishman, E. C. Kohn,
better subsets of genes using specific Microarray datasets. In
and L. A. Liotta, “Use of proteomic patterns in serum to identify
this sense, the utilization of multiobjective approaches could
ovarian cancer,” The Lancet, vol. 359, pp. 572–577, 2002.
contribute notably in gene subset selection.
[17] D. Singh, P. Febbo, K. Ross, D. Jackson, J. Manola, C. Ladd,
P. Tamayo, A. Renshaw, A. D’Amico, and J. Richie, “Gene expression
correlates of clinical prostate cancer behavior,” Cancer Cell, vol. 1, pp.
[18] G. J. Gordon, R. V. Jensen, L.-L. Hsiao, S. R. Gullans, J. E.
Blumenstock, S. Ramaswamy, W. G. Richards, D. J. Sugarbaker, and
PERFOM 3+3 Mediterranean project. E. Alba and J.
R. Bueno, “Translation of microarray data into clinically relevant
cancer diagnostic tests using gene expression ratios in lung cancer
and mesothelioma,” Cancer Res, vol. 62, pp. 4963–4967, 2002.
[19] S. Mukherjee, Classifying Microarray data using Support Vector
contract TIN2005-08818-C04-01 (the OPLINK project,
Machines, Boston, MA, 2003, ch. 9, pp. 166–185.
[20] T. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and
D. Haussler, “Support vector machines classification and validation of
cancer tissue samples using microarray expression data,” Bioinformat-ics, vol. 16, no. 10, pp. 906–914, 2000.
[1] A. Pease, D. Solas, E. Sullivan, M. Cronin, C. P. Holmes, and S. Fodor,
[21] J. Kennedy and R. Eberhart, “A Discrete Binary Version of the
“Light-generated oligonucleotide arrays for rapid dna sequence anal-
Particle Swarm Algorithm,” in Proceedings of the IEEE International
ysis,” in Proc. Natl. Acad. Sci., vol. 96, USA, 1994, pp. 5022–5026. Conference on Systems, Man and Cybernetics, vol. 5, 1997, pp. 4104–
[2] R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P.
Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D.
Bloomfield, and E. S. Lander, “Molecular classification of cancer:
Derivations, and Mathematical Insights,” 2005. [Online]. Available:
Class discovery and class prediction by gene expression monitoring,”
Science, vol. 286, pp. 531–537, 1999.
[23] A. Moraglio and J. Togelius, “Geometric particle swarm optimization
[3] Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and
for the sudoku puzzle,” in Proceedings of the Genetic and Evolutionary
A. J. Levine, “Broad patterns of gene expression revealed by clustering
Computation Conference (GECCO 2007). London, UK: ACM Press,
analysis of tumor and normal colon tissues probed by oligonucleotide
arrays,” Proc. Natl. Acad. Sci, vol. 96, pp. 6745–6750, 1999.
[24] E. Alba and M. Group, “Mallba: A Library of Skeletons for Combina-
[4] L. J. van ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M.
torial Optimisation,” in Proceedings of the Euro-Par, B. Monien and
Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T.
R. Feldmann, Eds., vol. LNCS 2400, 2002, pp. 927–932.
Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley,
[25] S. Cahon, E.-G. Talbi, and N. Melab, “Paradiseo: A framework for
R. Bernards, and S. H. Friend, “Gene expression profiling predicts
parallel and distributed metaheuristics,” in Proc. of Internationa the
clinical outcome of breast cancer,” Nature, vol. 415, pp. 530–536,
Parallel and Distributed Processing Symposium, 2003, pp. 144–155.
[5] Juliusdottir, D. Corne, E. Keedwell, and A. Narayanan, “Two-phase
EA/K-NN for feature selection and classification in cancer microarray
http://www.csie.ntu.edu.tw/ cjlin/libsvm, 2002.
datasets,” in CIBCB, 2005, pp. 1–8.
[27] K. Deb and A. R. Reddy, “Classification of two-class cancer data
[6] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection
reliably using evolutionary algorithms,” KanGAL Report No. 2003001,
for cancer classification using support vector machines,” MachineLearning, vol. 46, no. 1-3, pp. 389–422, 2002. [Online]. Available:
[28] L. Yu and H. Liu, “Redundancy based feature selection for microarry
data,” in Proceedings of International Conference on Knowledge
[7] J. Kohavi and G. H. John, “The wrapper approach,” in FeatureDiscovery and Data Mining, Seattle, Washington, 2004, pp. 22–25. Selection for Knowledge Discovery and Data Mining, 1998, pp. 33–50.
[29] B. Liu, Q. Cui, T. Jiang, and S. Ma, “A combinational feature
[Online]. Available: citeseer.ist.psu.edu/article/kohavi97wrapper.html
selection and ensemble neural network method for classification of
[8] Fix and J. L. Hodges, “Discriminatory analysis. nonparametric dis-
gene expression data,” BMC Bioinformatics, vol. 5, pp. 136–148, 2004.
crimination: Consistency properties,” 4, US Air Force School of
[30] Q. Shen, W. M. Shi, W. Kong, and B. X. Ye, “A combination of
Aviation Medicine, Randolph Field, TX, Tech. Rep., 1951.
modified particle swarm optimization algorithm and support vector
[9] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learn-
machine for gene selection and classification,” Talanta, 2006. ing, vol. 20, no. 3, pp. 273–297, 1995.
[31] R. T. Silver, “Anagrelide is effective in treating patients with
[10] M. Narendra and K. Fukunaga, “A branch and bound algorithm for
hydroxyurea-resistant thrombocytosis in patients with chronic myeloid
feature subset selection,” IEEE Trans. on Computer, vol. 26, pp. 917–
leukemia,” Leukemia, vol. 19, no. 3, p. 3943, 2005. [Online]. Available:
[11] J. Yang and V. Honavar, Feature Extraction, Construction and Selec-
[32] C. L. Loprinzi, “Venlafaxine in management of hot flashes in survivors
tion: A Data Mining Perspective, 1998, ch. Feature Subset Selection
of breast cancer: a randomised controlled trial,” The Lancet, vol. 356,
Using a Genetic Algorithm, pp. 177–136.
A DIVISION OF THE SYNOPTICS GROUP Automated colony counting for molecular biologists How it can improve productivity when selecting recombinant clone Introduction Transformation efficiency is a measure of the amount transmitted and darkfield illumination, enabling coloniesof cells within the baceterial culture that are able to takeand plaques to be counted on a range of media types.