資訊管理學報

翁慈宗;劉冠良;韓昀達;
頁: 87-115
日期: 2015/01
摘要: 近年來隨著定序技術的發展,生物學家不再以傳統的方式進行生態環境的研究,而是由環境中擷取微生物的樣本,並且藉由定序技術瞭解物種的資訊,從中探索物種的多樣性。在rRNA序列分類的過程中,會利用N-mer的移動窗口對基因序列資料作特徵萃取,所萃取出的相鄰特徵會有N-1個字元重覆,因此萃取出的特徵集合具有關聯性,這與簡易貝氏分類器條件獨立的假設互相違背。本研究希望透過馬可夫簡易貝氏分器處理基因序列資料這種高維度與需要龐大運算需求的分類問題,不僅是因為馬可夫簡易貝氏分類器在運算效率上的優勢,也因為結合了馬可夫模型能夠改善簡易貝氏分類器在條件獨立假設的問題,其中本研究採用多項式模型作為機率模型,在概似機率的計算上考慮了特徵頻率,而會有較佳的分類表現。此外,本研究加入了先驗分配-狄氏分配,期望藉由馬可夫簡易貝氏分類器和狄氏分配的結合,透過兩種先驗分配參數-分子先驗分配參數與分母先驗分配參數的設定,提升分類正確率。本研究以兩種不同的方式-狄氏分配_分子分母與狄氏分配_分母分子對四個基因序列資料檔來作測試。實證結果發現,狄氏分配_分子分母,在同一個類別值內先進行分子參數的調整,再進行分母參數的調整會有較好的分類結果。該兩種方法在參數調整完畢後,其分類正確率已高於RDP分類器,相較於簡易貝氏分類器結合狄氏分配,多了分母先驗分配參數可供調整,因此有較高的分類結果。
關鍵字: 狄氏分配;馬可夫模型;簡易貝氏分類器;核甘酸基因序列;

Dirichlet Priors for Markov Naïve Bayesian Classifiers with Multinomial Model for Gene Sequence Data


Abstract: Purpose-The RDP classifier is computationally efficient and does not require sequence alignment. It also works well with short sequences and provides a unique niche for applications using the NGS technologies that generate millions of short sequences. The performance however, is hampered by the conditional independent assumption on features. The dependency is especially obvious in attributes by which the k-mer method extracts sequences from sequences with sliding window where each attribute is overlapped by k-1 base with its previous and next attribute. Design/methodology/approach-This study developed a multinomial Markovbased Bayesian classifier which remedies the unrealistic independent assumption by Markov model. In order to prevent probability estimate of feature to become zero and distort the classification result, Laplace estimate is usually utilized as a prior for all features. However, the setting assumes a fix confidence level for all features. In this study, we further develop a noninformative generalized distribution for prior setting that will allow different confidence level settings for different features. Findings-The experimental results on bacterial 16S and fungal 28S rRNA gene sequence sets show that the proposed model can achieve higher prediction accuracy than the well-known RDP classifier in all ranks. Since the number of priors for a class value in the Markov naïve Bayesian classifier is two instead of one in the naïve Bayesian classifier, the best noninformative Dirichlet priors do enhance the performance of the Markov naïve Bayesian classifier. Research limitations/implications - The study proposes to model DNA sequences as a k^(th) order Markov chain on the alphabet A,C,G and T. That is, the probability of observing a particular symbol only depends on the k previous one. Since under this model, the probability of a read can be written as a ratio of products of the probability of overlapping k-mers, it does introduce additional computational overheads to the current implementation of the RDP classifier. However, the overhead is just one time calculation during training process practically. Practical implications-Since the ability to obtain thousands of rRNA sequences from environmental and Human Microbiome Project samples using high-throughput sequencing technologies has become a reality, accurate sequence classification is a critical component of ecological interpretation of environmental datasets. The approach used in this article to evaluate bacterial and fungal sequences proves to be a valuable tool to determine the most important factors affecting classification accuracy. Originality/value - This study develops a Markov Bayesian classifier that remedies the strong conditional assumption on naïve Bayesian classifier by Markov assumption. Since the k-mer method uses an overlapping sliding window to extract features from raw sequence data, the conditional independent assumption on features is clearly violated. In the opposite, the Markov model that assumes each feature is dependent on previous features fits k-mer extraction method perfectly.
Keywords: Dirichlet distribution;Markov model;naïve Bayesian classifier;rRNA gene sequence;

瀏覽次數: 23562     下載次數: 274

引用     導入Endnote