資訊管理學報

許中川;陳景揆;
頁: 103-122
日期: 2001/01
摘要: 新聞報導每天發生的重要事件,大量的新聞文件中,往往蘊含重要的資訊。文件資料探勘技術用來發覺隱藏在大量文件中的特徵。然而,目前的文件探勘研究集中在歐美語系文件,且代表文件的關鍵詞彙的擷取,都是人工處理。本研究以中文新聞文件為探勘對象,試圖發覺其中隱含的知識。針對新聞文件的特殊結構,在收集關鍵詞彙方面,以混合式斷詞法進行中文斷詞,經過關鍵既有詞彙擷取與關鍵新生詞彙擷取步驟,獲得每篇新聞文件的關鍵詞彙,代表該文件重要概念,供後續探勘之用。在資料探勘方面,首先為切合新聞文件知識開採需求,使用概念階層樹建構背景知識與關鍵詞彙。然後以關聯法則為基礎,我們提出三個改良式關聯模式:第一個是新生詞彙關聯法則,第二個是結構化資料與高頻詞彙關聯,第三個是結構化資料與某同類詞彙關聯;另外,以線性迴歸及卡方分配技術,分別探勘關鍵詞彙的報導趨勢與分佈情況。最後並以實驗驗證此探勘架構的可行性。
關鍵字: 文件資料探勘;知識發覺;關鍵詞彙擷取;關聯法則;趨勢分析;

Data Mining in Chinese News Articles


Abstract: News reports important daily events. Implicit information hides in huge collection of news articles. Text data mining technology aims at discovering knowledge hidden in large collection of texts. However, current reported research focus on English texts and keywords are given manually. This paper studied text data mining in Chinese news articles. Utilizing the special structure of news articles, existing keywords and new keywords, representing the content of a news article, are automatically extracted using hybrid segmentation technique. Then, the mining process guided by domain knowledge proceeds. We proposed three types of extended association rules: new keywords association rules, association rules of structured data and high frequency keywords, and association rules of structured data and homogeneous keywords. Further, linear regression technique and Chi-square test technique are used to analyzing the reporting trend of keywords and the distribution of important concepts. Experiments are conducted to verify the feasibility of the proposed architecture.
Keywords: text data mining;knowledge discovery;keyword extraction;association rules;trend analysis;

瀏覽次數: 9758     下載次數: 71

引用     導入Endnote