資訊管理學報

黃宇翔;潘柏璇;
頁: 135-155
日期: 2008/07
摘要: 延伸標記語言(eXtensible Mark-up Language; XML)規格是由全球資訊網標準製定組織(W3C)制定,並於1998年2月成為推薦規格。XML已逐漸成為網路上不同系統和資料庫間資訊交換的新標準,加上其結構化的特性,使得在處理大量XML文件分類成為一個重要課題。目前XML在文件分類上有利用Naïve Bayes演算法、樣版辨識和影像處理分割技術、詞性標記和法則式技術以及TFIDF以解決分類問題等方法,由於過去的研究鮮少針對文件本身的內容作分析,可能造成含糊文件或衍生的相關文件無法正確分類。本研究先以文件的樹狀結構特性找出每個項目的重要性等級,並利用TFIDF方法取得特徵項目後,便可藉由比對各類別的特徵項目將文件正確分類。在分類過程中,同時考量文件中的重要新詞以提高分類正確率。為使分類器能不侷限在限有特徵項目中,本研究也提出一個加入重要特徵項目的機制,使分類器能適應廣泛內容的文件。本研究最後與同樣使用階層特性的XML文件分類方法作一比較,結果顯示本研究能顯著改善分類之正確率。
關鍵字: 延伸標記語言;樹狀結構;文件分類;關聯資訊萃取;新詞;

An Approach of Classifying XML Documents with Tree-Like Structure and New-Term Usage


Abstract: The extensible mark-up language (XML) devised by the W3C has been a universally accepted and recommended specification. Recently, XML has gradually become a standard information interchange protocol for different systems and databases on the web. In addition, since XML has a characteristic of structural syntax, the classification of the tremendous amount of XML documents is thus of special essential in the field of knowledge management. Various approaches have been proposed on XML classification, such as Naïve Bayes, template reorganization, image processing, tagged analysis, and TFIDF, etc. However, these approaches rarely focused on analyzing contents of documents, and thus sometimes result in incorrect classifications. In this paper, we employed a tree-like structure to obtain the importance of each term, and utilized the TFIDF calculation to attain the special terms in documents. The classification can therefore be process by identifying these special items among documents. The use of new-term in documents is also under consideration to leverage the accuracy of classification. Finally, the proposed approach was compared with other similar approaches, and the results showed that the proposed approach can significantly improve the accuracy of classification.
Keywords: XML;tree-like structure;classification;association extraction;new-term;

瀏覽次數: 11453     下載次數: 194

引用     導入Endnote