![]() The segmentation and tagging tools implemented along with this thesis are publicly available as experimental frameworks for future development as well as preprocessing tools for higher-level NLP tasks. We propose that precision should be excluded and using recall alone is more adequate for sentence segmentation and word segmentation. Moreover, we investigate conventionally adopted evaluation metrics for segmentation tasks. Our system achieves substantially better results on languages that are more difficult to segment when compared to previous work. Based on this analysis, we apply language-specific settings to the segmentation system for higher accuracy. For word segmentation, we propose several typological factors to statistically characterise the difficulties posed by different languages and writing systems. The experimental results indicate that the proposed neural CRF model is effective for segmentation and tagging in general as state-of-the-art accuracies are achieved on datasets in different languages, genres, and annotation schemes for various tasks. The segmentation and tagging models are evaluated in a truly multilingual setup with more than 70 datasets. In addition, we explore effective ways of representing input characters, such as utilising concatenated n-grams and sub-character features, and use ensemble decoding to mitigate the effects of random parameter initialisation. We apply a general neural CRF model to different tasks by designing specific tag sets. In this thesis, we apply a sequence labelling framework based on neural networks to various segmentation and tagging tasks, including sentence segmentation, word segmentation, morpheme segmentation, joint word segmentation and part-of-speech tagging, and named entity transliteration. Segmentation and tagging of text are important preprocessing steps for higher-level natural language processing tasks. Language Technology (Computational Linguistics) Research subject Computational Linguistics Identifiers URN: urn:nbn:se:uu:diva-268921 OAI: oai::uu-268921 DiVA, id: diva2:881662 ConferenceFifth Named Entity Workshop, joint with 53rd ACL and the 7th IJCNLP, July 31 2015, Beijing, ChinaĢ018 (English) Doctoral thesis, comprehensive summary (Other academic) Abstract Place, publisher, year, edition, pagesAssociation for Computational Linguistics, 2015. Linear regression is adopted to rerank the outputs afterwards, which significantly improves the overall transliteration performance. For non-standard runs, we add multilingual resources to the systems designed for the standard runs and build different language specific transliteration systems. For standard runs, in which only official data sets are used, we build phrase-based transliteration models with refined alignments provided by the M2M-aligner. Our systems are applied to two tasks: English to Chinese and Chinese to English. This paper presents our machine transliteration systems developed for the NEWS 2015 machine transliteration shared task. 56-60 Conference paper, Published paper (Refereed) Abstract 2015 (English) In: Proceedings of the Fifth Named Entity Workshop, Association for Computational Linguistics, 2015, p.
0 Comments
Leave a Reply. |