Stochastic Learning-Based Estimation Methods for Pattern Recognition and Its Application to Topic Detection and Tracking
MetadataVis full innførsel
Every Pattern Recognition (PR) problem involves a training and a testing phase. In the training phase, the system is presented with samples, using which the distribution (also called the classconditional distribution), of the features, is estimated. Traditional PR systems assume that the class-conditional distributions are stationary, and thus that they do not change with time. Recently Oommen and his co-authors have presented a strategy by which the parameters of a binomial/- multinomial distribution can be estimated when the distribution is non-stationary. In this thesis, we propose a selection of performance indexes that take into account crucial characteristics of non-stationary environments. Furthermore, we use the proposed indexes to perform a more extensive empirical evaluation of the presented strategy, and compare it with traditional estimation algorithms operating in non-stationary environments. The purpose is to bring forward the unique strengths/weaknesses of the competing approaches. This thesis will consider the design and implementation of PR-systems dealing with such nonstationary environments. In particular, we shall concentrate on the application domain that deals with language classification in multilingual Word of Mouth discussions. Unlike traditional PR systems, one novel feature of our method is that the training is achieved by learning the N-gram characteristics of every language. The testing, however, invokes the SLWE because the sample documents being classified contain parts written in different languages, interspersed with each other, without the user knowing when one language stops, and the second language starts. Our empirical testing demonstrates that our proposed method is capable of classifying multilingual documents with high overall accuracy. We show that our method scales well with regard to the dimensionality of the feature space, and that it is resistant to textual errors in the testing data. Finally, and more importantly, the classifier performs extremely well when classifying segments of moderate size (15-20 words), with a reported overall classifier accuracy of 0:989, and adequately for shorter segments (10 words per segment), yielding an accuracy of 0:9596. Thus, we believe that our results provide additional insight into the performance of the SLWE and the MLE when operating in non-stationary environments. Furthermore, it is our opinion that our proposed technique for language classification will be of benefit in applications dealing with Pattern Recognition in multilingual text documents.
Masteroppgave i informasjons- og kommunikasjonsteknologi 2008 – Universitetet i Agder, Grimstad