Evaluation of alternative association measures for extraction of terminology based on a large Norwegian corpus
MetadataShow full item record
Original versionSYNAPS - A Journal of Professional Communication 26(2013)
Multiword expressions are words that co-occur so often that they are perceived as a linguistic unit (Stubbs 2007). Identifying them correctly is important for a variety of tasks within terminology, lexicography and language technology. This paper presents a methodology for the systematic and corpus-driven study of multiword expressions in Norwegian. It reports on a series of experiments using a variety of different association measures in order to identify multiword expressions that occur in a large corpus consisting of Norwegian newspapers (Andersen & Hofland forthcoming). The output of each association measure is a ranked list of bigrams and trigrams in the corpus. The value of different association measures for terminology purposes is assessed by considering the relevance and salience of ranked candidates among the bigrams and trigrams in the data. It is shown that the association measures differ greatly in their ability to pick out relevant term candidates. The paper also briefly evaluates the corpus itself and its relevance for terminology work (Kristiansen Forthcoming).