A novel spatio-temporal scheme for reducing the rate of false positives in bloom filter based URL-caching
MetadataVis full innførsel
Achieving efficient use of available resources is an important problem in the field of web mining. Monitoring and analyzing the web is extremely resource demanding, and therefore, more efficient use of resources often translates directly into improved web monitoring coverage and accuracy. One important sub problem is to reduce the memory consumption of the URL cache in a web crawler system. Utilizing the space efficient data structure Bloom filter as URL cache, will reduce the memory consumption. However, the Bloom filter introduces false positives, leading to loss of valuable web content when the filter are utilized as a URL cache in a web crawler system. Based on the latter problems of false positives, this thesis propose three novel strategies, namely a temporal, a spatial and a spatio-temporal strategy, each aiming to reduce the false positive rate introduced by the Bloom filter. During testing and evaluation of the strategies, we discovered both the spatial and temporal strategy is able to reduce the false positive in the Bloom filter. The two former strategies was then combined to test if it is possible to further decrease the false positive probability. Testing and evaluation of the combined strategies shows that it does yield a reduction in the false positive probability.
Masteroppgave i informasjons- og kommunikasjonsteknologi 2010 – Universitetet i Agder, Grimstad