Knowledge Base Acceleration Using Features Inspired by Collaborative Filtering
MetadataVis full innførsel
Most of us use knowledge bases every day, whether it is Wikipedia, Netflix, an online newspaper or a dictionary. These knowledge bases contain both timeless and up-to-date information and must be updated continuously. A knowledge base is often updated with content from channels with large amounts of information. When a human editor updates a knowledge base, it can be difficult to distinguish important signals from noise. To automatically identifying central documents from a stream of documents where the purpose is to expand a knowledge base is called Knowledge Base Acceleration (KBA). The task of separating the central documents is called Cumulative Citation Recommendation (CCR). Through feature extraction, document classification and ranking, we can find documents that will be central to a knowledge base. It will then be up to a human editor to select which documents are to be included in the knowledge base. Our work in this thesis has two parts. First, we want to verify work done earlier in this field by applying similar principles to a new dataset. The second part of the thesis focuses on expanding the dataset with new features that are based on principles from Collaborative Filtering (CF). The model that solves the first part of the thesis achieves an F1 value of 0.853. This is twice as good as the reference model which determines whether a Twitter message is central based only on the number of likes. Furthermore, we expand the first model with new features inspired by CF. The new model is significantly better and achieves an improvement of 1 % points. The thesis' main contribution to the KBA field is the insight that; taking into account the relationships between authors of incoming documents can improve a model's ability to identify central documents. Furthermore, we have verified that relevant principles from KBA work by applying these principles to a whole new dataset. The results of this work have been very satisfactory and are statistically tested.