Enhancing hierarchical clustering with local search
MetadataVis full innførsel
Data Clustering is defined as grouping together objects which share similar properties. These properties can be anything as long as it is possible to measure and compare them. Clustering can be an important tool in many different settings varying from medical use to data mining. In this work we distinguish between two different types of clustering. The simplest one, called partitional clustering, tries to create one solution by comparing objects and partitioning them into non-overlapping groups. The second one, called hierarchical clustering, is a bit more complex and will try to generate a multi-layered solution instead. In this multi-layered structure, groups actually consist of more than one sub-group from a lower layer. The structure is often and probably best thought of as a tree-structure. The only way to get, and ensure, a “perfect” solution is to use a brute force clustering algorithm which works by comparing all possible solutions against each other. We do not implement a brute force method; instead we use another algorithm which generates very high quality solutions. The goal of this study was to develop an algorithm for hierarchical clustering with the ability to solve large problems quicker than the alternative existing solution. The approach to this was to make the algorithm work with only parts of the problem area at any time. In this report this is called:”local search”. Some other important concepts also explained and used in this report are local/global optimum, online/offline clustering and partitional/hierarchical clustering. The already existing algorithm briefly mentioned serves as a “template” which we use to compare and test our developments against. This testing has been performed using an application we have developed specifically for this purpose. The results from our experiments chapter hints that the approach we have taken when developing the algorithms is very much on the right track. Our algorithms perform high quality clustering within decent time. Compared to the “state of the art” algorithm we have tested them against they provide, on large problems, somewhat less quality solutions, but at a much better time. That being said we still think there may be room for improvements in any future work.