Genethic Algorithm and Tree Based Classification in Bioinformatics

Inese Poļaka; Arkādijs Borisovs

Genethic Algorithm and Tree Based Classification in Bioinformatics

European Conference on Data Analysis 2013: Book of Abstracts 2013
Inese Poļaka, Arkādijs Borisovs

The study proposes a decision tree based classification of gene expression and protein display data. The problem tackled in the research is accurate classification of records that hold gene or protein microarray data of cancer patients and healthy donors. The data available to discriminate between conditions is overwhelming for most classical classifiers because of the large dimensionality – while the number of patients is scarce due to the high costs of the tests, the number of genes or proteins tested simultaneously is very large (several thousand). The study proposes using random subspace method with genethic algorithms to generate decision tree classifiers for bioinformatics microarray data. The hypothesis of forming an efficient decision tree classifier with limited depth from a randomly chosen subspace of attributes is based on the facts that only a small fraction of the genes or proteins used in the microarray tests are actual biomarkers of a medical condition in question (cancer) and the classifier has to be small enough not to be too specific for the small number of records (Occam’s razor). The full data set is reduced by forming a ‘gene pool’ consisting of a random attribute subset that is used in the process of decision tree generation using genethic algorithm. The genes of the chromosome being the attributes used in a decision tree classifier the ‘gene pool’ is used to determine an attribute at each node of the tree (defining a ‘gene pool’ for each node). The search for the classification model in the finite random subspace is carried out using the classical genethic algorithm and its operators selection (elitism), cross-over (after tournier selection of parent chromosomes) and mutation. The search is performed by increasing the probability of the best gene in the gene pool for the best classifier in each iteration. The experimental evaluation of the proposed method shows that the classifiers are as accurate or even more accurate than the classical algorithm models without the need for costly computations considering all attributes and the over-growing the classifiers.

Atslēgas vārdi
The study proposes a decision tree based classification of gene expression and protein display data. The problem tackled in the research is accurate classification of records that hold gene or protein microarray data of cancer patients and healthy donors. The data available to discriminate between conditions is overwhelming for most classical classifiers because of the large dimensionality – while the number of patients is scarce due to the high costs of the tests, the number of genes or proteins tested simultaneously is very large (several thousand). The study proposes using random subspace method with genethic algorithms to generate decision tree classifiers for bioinformatics microarray data. The hypothesis of forming an efficient decision tree classifier with limited depth from a randomly chosen subspace of attributes is based on the facts that only a small fraction of the genes or proteins used in the microarray tests are actual biomarkers of a medical condition in question (cancer) and the classifier has to be small enough not to be too specific for the small number of records (Occam’s razor). The full data set is reduced by forming a ‘gene pool’ consisting of a random attribute subset that is used in the process of decision tree generation using genethic algorithm. The genes of the chromosome being the attributes used in a decision tree classifier the ‘gene pool’ is used to determine an attribute at each node of the tree (defining a ‘gene pool’ for each node). The search for the classification model in the finite random subspace is carried out using the classical genethic algorithm and its operators selection (elitism), cross-over (after tournier selection of parent chromosomes) and mutation. The search is performed by increasing the probability of the best gene in the gene pool for the best classifier in each iteration. The experimental evaluation of the proposed method shows that the classifiers are as accurate or even more accurate than the classical algorithm models without the need for costly computations considering all attributes and the over-growing the classifiers. Projekts:

Poļaka, I., Borisovs, A. Genethic Algorithm and Tree Based Classification in Bioinformatics. No: European Conference on Data Analysis 2013: Book of Abstracts, Luksemburga, Luxembourg, 10.-12. jūlijs, 2013. Luxembourg: 2013, 107.-107.lpp. ISBN 9782879711058.

Publikācijas valoda
English (en)

Publikācijas veids
Raksts konferenču tēžu krājumā
Pamatdarbībai piesaistītais finansējums
Nav zināms
Pētniecības nozare
2. Inženierzinātnes un tehnoloģijas
Pētniecības apakšnozare
2.2. Elektrotehnika, elektronika, informācijas un komunikāciju tehnoloģijas
ID: 16697