Historical-simulation LLM training
Human speakers gain a deeper understanding of a language by encountering earlier forms of it. But most languague models pretrain only on internet-age texts. If we train a language model in a historically reconstructed fashion, starting with renaissance texts and moving forward epoch by epoch, how would it perform differently on modern texts?
Out of convenience, modern LLMs are trained almost exclusively on text from the past five decades or so, with a heavy bias toward the 21st century*. But historical forms of a language affect its speakers’ usage and comprehension too. In the minimum sense, the average living speaker first learned the language as it was thirty or forty years ago, and learned it partly from speakers who themselves had learned it twenty to forty years before that. Most speakers also read older forms of the language, albeit to varying extents, for education and for pleasure. In the maximum sense, older forms of the language could be understood to be the causal root and explanatory justification of many elements of the modern language, as well as the necessary context for understanding connotation and affect.
So it’s very plausible that training on older texts could deepen a language model’s grasp of syntax, grammar, and connotation. But training on historical text indiscriminately commingled with modern text introduces obvious hazards for the practical tasks that language models are often used for (such a model might generate anachronistic text, or parse comical archaisms at face value). One approach that could confer the benefits of a foundational understanding of historical language while avoiding such pitfalls would be to train the model on the oldest available texts for the first epoch (or first n epochs), and progress to increasingly modern text with each epoch(s). The only remotely relavent precedent I’m aware of is HistBERT (Qiu and Xu, 2022), a BERT variant that was additionally pretrained on a time-balanced corpus of twentieth-century American texts in order to study semantic change in word senses over time. But while HistBERT is used to evaluate texts from each decade separately, it is trained on all of them at once – rather than on each progressively, as I am proposing here.
My proposed time-progressive training method method would have plenty of hazards – most notably the differing data availability and selection biases of different eras (not only do we have less text from the early modern era, we have very different genres of text, and nothing as conversational as 21st-century reddit threads). But it could lead to a stronger language model. In any case, it would be fascinating to see how a historical grounding would affect the performance of an LLM on downstream NLU tasks.
*see this table that I’ve compiled of dataset categories on which various LMs are trained. The only dataset that I’m aware contains a significant portion of historical text is the Books-general category (usually either Gutenberg or Books3/Bibliotik).