Domain Adaptation for Part-of-Speech Tagging of Indonesian Text Using Affix Information

2021
Abstract Part-of-speech tagging is a process to apply word class of a word in texts. POS Tagger for specific language is usually built with generic domain corpus, for example using text from newspaper. If this POS Tagger tested against word from new domain or another specific domain, then the POS Tagger can possibly word class inaccurately. Solving specific domain adaptation can be done by using several methods, using clustering to change word representation or using model with big number of lexicon and using labelled texts from specific domain for training the model. In this research we apply domain adaptation method by using additional lexicon that built based on affix rule. Specific domain used is beauty product domain. Component for this system is a POS Tagger with generic domain and unlabeled lexicon from target domain. Word class in target domain lexicon applied based on affix information and the remains labelled manually. Based on observation to the dataset, words in English was often to be used, so the lexicon developed in Indonesian and English. The processed lexicon added in lexicon from original POS Tagger to give specific domain information to the POS Tagger with generic domain. The POS tags focused in this study are noun, proper noun, adjective and adverb because results from this POS Tagger are used for aspect and opinion extraction. Tagger with added lexicon achieve 68.99% accuracy and the percentage of words that are successfully recognized by tagger is 92.36%.
    • Correction
    • Source
    • Cite
    • Save
    7
    References
    0
    Citations
    NaN
    KQI
    []
    Baidu
    map