Knowledge discovery from large collections of unstructured information
MetadataShow full item record
The number of scientific publications is increasing by 3% per year, making it difficult for scientists to keep up with new research and to find relevant papers. As a result, time they could spent on research is used to stay up-to-date with the state-of-the-art in their fields. Wikipedia increasingly serves as a reliable source for up-to-date scientific knowledge. This fact is the main motivation for developing a knowledge discovery support system capable of retrieving relevant documents on a selected subject. This thesis focuses on state-of-the art solutions, aiming to support knowledge discovery from Wikipedia articles. While the majority of similar tools targets biomedical publications, this thesis focuses on Wikipedia pages related to marine science. The four main components of the proposed system are: document retrieval, document filtration, information extraction and interactive search. The retrieval component extracts the core page content from the Wikipedia pages and removes all wiki markup, leaving only the plain text of the article. The document filtration component selects those articles related to marine science by using a machine learning solution called topic modelling, which uses statistical methods for finding topically similar documents. The information extraction component performs named entity recognition for geographical locations and marine species. In addition, it detects events involving variables that are increasing, decreasing or changing. Topic models were evaluated against a gold standard that was created with the help of domain experts using a pooling method. The evaluation of document filtering indicated that the best performing topic model is Latent Semantic Analysis (LSA) configured with 500 topics, yielding an NDCG score of 0.70. This model was subsequently used to retrieve 4727 marine science related articles from an initial list of 22 seed articles. The search interface is a single-page web application which provides faceted search. The Solr-based implementation allows retrieving documents by search terms and filtering them by facet, including location, species and changing variables. A visualisation method displays the extracted geographical locations as pinpoints on a map and presents changing variables in the text with colours indicating the direction of change.