資訊管理學報

胡雅涵;黃正魁;楊承翰;
頁: 305-339
日期: 2014/07
摘要: 數位資訊迅速地成長,增加人們在找尋資訊上的搜尋成本,如何有效地分類管理文件已是一項重要的研究議題。因此,文件分類研究的重要性與日俱增,在文件分類領域中存在文件特徵維度過高的問題,因此,我們以基因演算法(Genetic Algorithm, GA)為基礎選取文件中特徵字詞,透過對GA染色體於文件特徵向量設計和調整GA設定的參數,讓分類器(Classifier)從訓練資料中選取特徵字詞,並進行文件分類模式建構。本研究提出之GA特徵選取(GA-based Feature Selection, GAFS)方式,透過讓各單一分類器都能自我學習達到最佳化,進而提升各分類器的分類效能,以建構出分類效果最佳化的文件分類模式。實驗部分,本研究採用WebKB網頁文件資料集,評估GAFS所建立的文件分類模式,並與傳統將所有特徵集合進行訓練之方法(簡稱TOTAL)做比較。本研究採用六種不同的分類器模式,包含貝氏分類器(Naïve Bayesian Classifier)、決策樹(Decision Tree)、分類迴歸樹(Classification and Regression Tree)、隨機森林(Random Forest)、支援向量機(Support Vector Machine),以及k最近鄰居法(k Nearest Neighbor)。實驗結果顯示,本研究提出之GAFS方法能夠有效地改善各分類模式的分類效能,證實以GA為基礎之GAFS自動化文件分類模式明顯優於TOTAL,並且在特徵維度逐漸擴大的情況下,GAFS仍能有效地改善分類效能,並且擁有穩定地分類準確率。
關鍵字: 文件分類;基因演算法;特徵選取;分類器;

A Genetic Algorithm Based Approach for Text Categorization


Abstract: Purpose: Digital data has been accumulated rapidly resulting in the significant increase in the cost of searching information from the data source. How to effectively manage documents (i.e., text categorization, TC) has become an important research issue. However, in TC, huge amount of index terms are selected for representing document vectors, resulting in poor prediction outcomes. This study proposes a genetic algorithm based feature selection (GAFS) method to optimize the selection of index terms. Design/methodology/approach: Before training classifiers, GAFS selects a reduced set of index terms that can optimize the prediction accuracy of classifiers. In experimental study, the WebKB dataset was used to evaluate the performance of GAFS. A total of six well-known classification techniques were considered, including naïve Bayesian classifier (NB), decision tree (DT), classification and regression tree (CART), random forest (RF), support vector machine (SVM) and k-nearest neighbor (kNN). The baseline model, denoted as TOTAL, is to consider complete set of index terms in all experiments. Findings: The results show that the proposed GAFS method outperforms the TOTAL method. The performance of kNN and RF classifiers deteriorates as the number of features increases. Under different number of features, the SVM, NB, and DT classifiers perform stably but the CART classifier has relatively unstable performance. Research limitations/implications: This study only considers the WebKB dataset. Future research is recommended to include other well-known datasets in the TC domain. Other feature selection methods can be also considered in the experimental evaluation. Practical implications: Two practical implications are provided. First, this study reveals that different parameter settings in genetic algorithm (GA) can significantly affect the performance of feature selection in TC. Second, the proposed GAFS method allows users to systematically construct a robust classifier for TC. Originality/value: This paper investigates the influence of the parameters used in GA for the feature selection in TC. It advances the literature in choosing GA parameters and classification techniques for optimizing the TC performance.
Keywords: document categorization;genetic algorithm;feature selection;classifier;

瀏覽次數: 13279     下載次數: 216

引用     導入Endnote

相關文章推薦

Top Downlaod Papers