資訊管理學報

楊燕珠;陳志豐;
頁: 165-184
日期: 2009/01
摘要: 網際網路普及,越來越多使用者在網路上搜尋相關資料進行閱讀,本研究目標是將大量文件資料進行主題集群分析,方便使用者能很快瞭解文件集有哪些主題,迅速選擇所需主題的文件進行閱讀。本研究以關聯規則之高頻項目集結合近似樣式匹配,探勘出「近似高頻樣式」作為文件特徵;並將近似匹配的距離(相似度)納入特徵權重的衡量中。此外,本研究提出以「密度和相似度為基礎之二階段分群演算法」,此方法不需預先設定群集數目,適合於大量文件分群。經過實驗結果顯示,「近似高頻樣式」的特徵數量是彈性詞對的1.42倍,單一詞彙的0.84倍,透過此特徵分群,平均召回率、精確率和正確率皆較彈性詞對、相鄰詞對、單一詞彙等特徵的分群結果為高,證明以「近似高頻樣式」確實能抽取出更多有意義且具備區別力的特徵,搭配所提出的分群演算法,可以提昇分群速度,易於決定適當的群數,並提高文件分群的品質與正確性。
關鍵字: 高頻項目集;樣式匹配;特徵抽取;文件分群;

Document Clustering Based on Frequent Itemset Integrated with Approximate Pattern Matching


Abstract: Due to the popularization of the Internet, more and more users read desired data by directly searching from the Internet. This research aims to group a large number of texts by thematic document clustering for users rapidly realizing how many topics in those texts and picking up the interested topics to read. In order to extract more meaningful features, we propose an approach integrating frequent itemset with approximate pattern matching to mine the ”Approximate Frequent Patterns”. The distance (similarity) of approximate matching is adopted in measurement of feature weights, which is different from the traditional support count (frequency) of itemsets. In addition, the ”Two-Phase Density and Similarity-Based Clustering Algorithm” is presented. This method doesn't need setting cluster number in advance, so as to be suitable for thematic document clustering. The experimental results show that the number of ”Approximate Frequent Patterns” is 1.42 times of that of flexible word pairs and 0.84 times of that of single terms. Using this feature extraction, the clustering result in average recall, precision and accuracy are all higher than flexible word pairs, bigram and single word. This proves that ”Approximate Frequent Patterns” can really extract more meaningful and discriminative features. Besides, our presented clustering algorithm can promote the speed, easily decide appropriate cluster number, and improve the quality and accuracy of document clustering.
Keywords: Frequent Itemset;Pattern Matching;Feature Extraction;Document Clustering;

瀏覽次數: 13062     下載次數: 221

引用     導入Endnote