
頁: 27-45
日期: 2008/04
摘要: 電子郵件是現代人最常用以接收資訊的媒介之一,然而許多人利用它的方便、快速、及成本低廉等特性散佈大量電子郵件,以達到廣告宣傳的效果。如此造成電子郵件用戶的信箱中充斥著大量未經用戶許可的垃圾郵件。因此解決垃圾郵件問題是一個重要且急迫的議題。本研究的目的即在於提出適當的分類方法以提高過濾垃圾郵件的績效。所提方法「以遞增式分群為基之分類器(Incremental Clustering-Based Classifier; ICBC)」是以分群為基礎之分類演算法:ICBC會將文件分成數群,並找出每群等量具代表性的特徵,以解決垃圾郵件資料偏斜的問題;同時ICBC也具有遞增學習的能力,能以較低的成本及較快的速度來適應環境的改變,以解決電子郵件主題漂移的問題,並避免每次用所有資料重新學習的成本負擔。本研究共進行了四個實驗以瞭解所建構分類器的效能,實驗結果顯示ICBC可有效地同時處理中英文垃圾郵件之過濾;也能克服資料偏斜與主題漂移的問題。這些實驗結果驗證了ICBC的適用性。
關鍵字: 垃圾郵件過濾;資料偏斜;主題漂移;遞增式學習;

An Incremental Cluster-Based Classification Approach to Filtering Spam with Skewed Classes and Drifting Concepts

Abstract: E-mail has become one of the most popular communication channels for people to disseminate information nowadays. However, because of its convenience, speediness, and low cost, some people abuse this channel to spread information for advertisement and promotion purpose. This often causes users’ troubles in managing their mailboxes. These unsolicited and undesired emails are referred to as spam (or junk emails). Spam filtering, therefore, is an essential issue to help users get rid of annoying emails. The purpose of this research is to propose an appropriate classification approach to filtering spam with skewed classes and drifting concepts. An Incremental Cluster-Based Classification method (ICBC) is proposed accordingly. ICBC first clusters documents into several groups, and an equal number of keywords are then extracted from each group to alleviate the problem of skewed class distributions. In addition, ICBC also possesses the ability of incremental learning that can adapt itself to the changing environment with drifting concepts and avoid the cost of re-training. Four experiments are conducted to evaluate ICBC. The results show that ICBC can effectively classify both Chinese and English spam. It can also deal with the issues of skewed class distributions and drifting concepts. The feasibility of ICBC is thus justified.
Keywords: Spam filtering;skewed class distribution;concept drifting;incremental learning;

瀏覽次數: 17384     下載次數: 112

引用     導入Endnote


Top Downlaod Papers