Dialect Clustering with Character-Based Metrics: in Search of the Boundary of Language and Dialect.

Yo Sato,Kevin Heffernan

Dialect Clustering with Character-Based Metrics: in Search of the Boundary of Language and Dialect.

2020

Yo Sato
Kevin Heffernan

We present in this work a universal, character-based method for representing sentences so that one can thereby calculate the distance between any two sentence pair. With a small alphabet, it can function as a proxy of phonemes, and as one of its main uses, we carry out dialect clustering: cluster a dialect/sub-language mixed corpus into sub-groups and see if they coincide with the conventional boundaries of dialects and sub-languages. By using data with multiple Japanese dialects and multiple Slavic languages, we report how well each group clusters, in a manner to partially respond to the question of what separates languages from dialects.

Keywords:

Computer science
Speech recognition
Cluster analysis
Artificial intelligence
Natural language processing

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations