Online News Detection on Twitter
MetadataVis full innførsel
In this thesis, we seek to find suitable methods for detecting news on Twitter within the fields of artificial intelligence, information retrieval, computational linguistics, and natural language processing. We combine these fields to find newsworthy tweets, cluster them based on time and similarity and find the most representative tweet for a news event. We compare different methods within the field of topic modeling to find news topics and tweets related to them in an online setting. Then we have a look at an online clustering approach to be able to detect news while they are being clustered based on time of arrival and similar content.One of the greatest challenges is to find the tweets we can characterize as news in the ocean of tweets. Many tweets are about personal matters or two-way commu- nication among friends. We call these uninteresting tweets chatter . There are many tweets that contain abbreviations, misspellings, and lack of proper sentence structure. This makes it difficult for otherwise good natural language processing systems to evaluate the content and proper language in tweets.Our study shows that finding news using topic modeling is difficult. Training a proper model is time consuming, and even when a working model is obtained, it is unclear how to effectively use it to detect news.While developing the online news detection system for Twitter, we have found that the clustering approach elaborated on in this thesis works well for tweets. The system clusters similar tweets based on time and content, and it performs well doing so. Due to the information entropy and tuning of parameters in the clustering algorithm we were able to achieve a higher precision than the baseline clustering algorithm.Lastly, we have found that the task of finding the most representative tweet for a news event is simple when the tweets have been clustered well, that is when the clusters contain mostly news relevant twets.