Distant Supervision and Sentiment Embeddings for Ternary Twitter Sentiment Analysis
MetadataVis full innførsel
Tang et al. (2014) acknowledged the context-based word embeddings inability to dis-criminate between words with opposite sentiments that appear in similar contexts. Anexample is the words good and bad two opposites that appear in the same con-texts. Context-based word embedding methods like word2vec would likely treat theseas similar words. Tang et al. proposed a promising method for incorporating sentimentinformation in word embeddings. These embeddings are called Sentiment-Specific WordEmbeddings orSentiment Embeddings.To train sentiment embeddings, large amounts of sentiment-annotated data are needed.Manual annotation is too expensive for this purpose. Fast, automatical annotation isused to set a low-quality (weak) label on large corpora of tweets. This procedure is oftenreferred to asdistant supervision. The traditional approach is to use the occurrences ofemoticons to guess binary sentiment (positive/negative).In this thesis, we compare various lexicon-based sentiment classifiers against eachother on manually annotated Twitter data from the International Workshop on SemanticEvaluation (SemEval). Their performance as distant supervision methods are tested aspart of a complete Twitter Sentiment Analysis system. Instead of only looking at thepositive and negative sentiment classes, the neutral class is included. Both predictionperformance and speed of the distant supervision methods are evaluated.We propose the Ternary Sentiment Embedding Model a new model for creatingsentiment embeddings for the ternary sentiment classification task. It is based on theHybrid Ranking Model of Tang et al. (2016), but trains on ternary-labeled distant-supervised data instead of binary-labeled. The model trains sentiment embeddings fromdatasets made with different distant supervision methods. The model is used as part ofa complete Twitter Sentiment Analysis system and is compared to existing systems.The experiments of Chapter 8 show that the Ternary Sentiment Embedding Modelperforms better than the Hybrid Ranking Model of Tang et al. (2016) in most cases. Ourresults show that the quality of the distant-supervised dataset has a great impact on thequality of the produced sentiment embeddings, and hence the entire Twitter SentimentAnalysis system.