Automatic data extraction from online discussion boards
MetadataVis full innførsel
There has been written many papers on field of mining data from structured web pages. However, few if any of these papers focus on the area of retrieving specific parts of discussion board postings. A discussion board page contains a set of postings, which can be considered data-records. Our goal is to provide insight on a specific approach to identify the locations of author, content and date+time, which are parts of a complete discussion board posting data-record. Our approach consists of combining a Naive Bayes pattern classifier, structure classification and grammar to identify the sought after elements. We give a thorough evaluation of our Naive Bayes classifier and it’s components in addition to how combinations of the different parts in our approach affected the overall result. Our best results for identifying the location of the individual elements was 94% for author, 76% for content, 86% for date+time and 60% for getting every element of each post correct. While the result for getting the complete posts is not very good, it does depend a lot on the other results. We believe our approach shows promise and with further development and refinement, it will be a viable method for automatic extraction of data from on-line discussion boards.
Masteroppgave i informasjons- og kommunikasjonsteknologi 2009 – Universitetet i Agder, Grimstad