LLM-based semantic maps

How contextual word embeddings can be used to explicate and visualize word-senses, homonyms, and their shifts over time.

Since the introduction of modern distributional word embedding methods like word2vec (2013), lots of studies have used word embeddings for semantic analysis. But that sort of analysis is severely hampered by word2vec’s inability to distinguish different homonyms and word-senses that share a spelling (e.g. the bird and the action “duck”). Typically they end up shedding light on the usage frequency ratio of different homonyms or senses instead of on the actual word-senses themselves. This is especially problematic for diachronic analyses, where the obvious vector shift caused by changes in the usage ratio (especially if resulting from a new word-sense) tends to wash out any individual sense’s shift.

A few recent papers (notably Giulianelli et al. 2020 and Qiu and Xu 2022) have demonstrated that contextual word embeddings effectively solve this problem and yield more detailed synchronic and diachronic analysis. In this project, I expand on Giulianelli’s work to produce more detailed and intuitive LLM-based semantic analysis. Specifically, I

benchmark the word-sense disambiguation ability of embeddings from more powerful model architectures like DeBERTa and OPT and clustering methods beyond just k-means (hopefully producing new SOTA correlation with human judgments)
develop an intuitive plotting and visualization method that can be useful for linguists, lexicographers, and hobbyists who do not have the technical skills to query large language models or extract and manipulate the resulting vectors
include a wide range of examples for examination and comparison

Work on this project is partially complete (paused in spring 2023 for sake of other work), but I hope to return to it in fall 2023.