Incremental web crawling as a competitive game of learning automata
MetadataVis full innførsel
There is no doubt that the World Wide Web has lived up to it’s hype of being the world’s central information highway through the past years. An increasing amount of versatile services keeps finding their way onto the Web as information providers continue to embrace the possibilities that the Web can offer. Especially the possibility of producing dynamic content has been an accelerant factor and is the reason why we now conveniently can participate in online auctions or see the latest development of our favorite stocks in near real-time from our own living rooms. However, for automated data mining applications that deploy crawlers to continuously capture the information provided by this new breed of services, the highly dynamic nature of the content is not convenient at all. As a matter of fact, a complete new set of challenges emerges where traditional crawling strategies are shown to be sub-optimal. Accordingly a new class of methods for crawling operations are clearly needed. Nonetheless, the problem area has so far been given limited attention in literature. In this thesis we address the new problem area of monitoring highly dynamic data sources of different importance. We use the concept of an incremental web crawler as a basis for our novel approach where we consider the incremental crawling task as a continuous learning problem where scheduling of monitoring tasks is combined with parameter estimation in an on-line manner. By mapping the problem to two variants of the so called knapsack problem we propose two solutions based on a machine learning technique known as learning automata. We show empirically that our proposed solutions continuously improve their performance through a learning process and that they are capable of operating in non-stationary environments. We also show their performance in comparison to alternative algorithms where, most notably, our schemes are shown to outdo the traditional uniform crawling scheme by factors up to 550% in certain situations.
Masteroppgave i informasjons- og kommunikasjonsteknologi 2005 - Høgskolen i Agder, Grimstad
UtgiverHøgskolen i Agder
Agder University College