Exploring Morpheme Length and Classification Accuracy in Dravidian Languages: An Unsupervised Learning Approach

Main Article Content

Thivaharan.S, Srivatsun.G

Abstract

Agglutination in south Indian languages like Tamil, Telugu and Malayalam enriches the volume of lexicons and vocabulary, resulting in a volume of capable morphemes. Though agglutination is considered as a prominent feature that enriches a language, it also leads to lesser accuracy in feature classification during morpheme handling and analysis, which eventually results in inappropriate mapping across languages of similar nature. Agglutination along with regional dialects, poses an open challenge in attaining high accuracy. In this paper, an unsupervised learning model based on polynomial regression framework is proposed for morphological segmentation and a study on how the morpheme length affects the classification accuracy is done. The model is based on unigram word segmentation with an assumption that morph length in the investigative data is evenly distributed. Two Morpho-tactically related language components, Informal Tamil and Deutsch (German language) were taken for consideration. Experimental results are benchmarked against the unique statistical morphological toolkit. The paper concludes through the experimental results claiming that morpheme length has definitive impact in the analysis and in improvising the prediction accuracy as close to 87%.

Article Details

Section
Articles