Oxford University Press's
Academic Insights for the Thinking World

Numbers and historical linguistics: a match made in heaven?

Whatever you associate with the term “historical linguistics,” chances are that it will not be numbers or computer algorithms. This would perhaps not be surprising were it not for the fact that linguistics in general has seen increasing use of exactly such quantitative methods. Historical linguistics tends to use statistical testing and quantitative arguments less than linguistics generally. But it doesn’t have to be like that.

Linguistics generally has seen an increase in the use of corpora and quantitative methods over the recent years. Yet journal publications in historical linguistics are less likely to use such methods. Part of the explanation is no doubt the advantage that linguistics for extant languages holds regarding greater availability of annotated text corpora and people who can answer questionnaires or take part in experiments. Yet this can only be part of the explanation.

Although historical records are clearly patchy and biased, there is nevertheless much information that can be processed quantitatively. For instance, there is an increasing variety of computational language resources available for historical language varieties. Similarly, the computational and statistical tools for processing and using these resources are becoming increasingly open and easy to use.

Second, historical linguistics has a long tradition of quantitative methods. Going far back in time, historical linguistics has been informed by statistics, counting, and quantitative measures. However, this way of doing historical linguistics has never been mainstream, in the sense that it is not the typical or most frequent way of doing research in historical linguistics.

This suggests a golden opportunity: why not use probabilities to estimate changes in language? Or use statistics to measure similarity between varieties? Or crunch numbers in order to describe the chances that a phenomenon did or did not occur in the past, given the available yet inevitably patchy evidence? After all, these are precisely the scenarios that quantitative techniques are ideal for.

Why not use probabilities to estimate changes in language? Or use statistics to measure similarity between varieties?

In short, there is nothing inherent in the field of historical linguistics that suggests it should not make greater use of quantitative techniques. An interesting metaphor to describe this state of affairs is what is known as the technology adoption curve. The curve takes the form of a bell curve covering all potential users. On the far left, where the curve is thin, we find the early adopters, those who will pick up new technology either out of sheer curiosity or because they think it will give them an advantage. Moving right, where the peak of the bell curve sits, we find the majority of users, who will only adopt a new technology when it is convenient or when other choices are becoming inconvenient. Between these groups, the early adopters and the large majority, there is a metaphorical gap, or chasm. For any new technology, crossing this chasm is the key to success.

Looking at linguistics as a whole, it appears that quantitative methods have indeed crossed that chasm and gone mainstream. Research papers in leading journals are increasingly making use of statistical techniques to support their linguistic arguments. Rather than a paradigm shift, this can be viewed as quantitative techniques being adopted by the majority and going mainstream, to the extent that some may feel they are being pressured to use such techniques.

Conversely, historical linguistics seems to have resisted this trend to a greater degree. It is reasonable to look to cultural explanations for this. After all, the technical barriers keep getting lower and the availability of resources keep increasing. So what is special about historical linguistics? For one thing, historical linguistics (at least if we consider the historical-comparative method) has a very long, very stable, and very successful history. The methodological core of the historical-comparative method has proved remarkably stable over time.

Furthermore, there is a history of failed attempts at using quantitative methods in historical linguistics. In some cases, such techniques have been tested and simply failed to work, as one would expect in any scientific endeavour. In other cases, the lack of extensive quantitative modelling by historical linguists have enticed scholars from other fields, with experience in statistical models, to step in and fill that gap. These endeavours have met with mixed reactions from mainstream historical linguistics.

What seems to be missing is a positive case for using quantitative methods in historical linguistics, on the premises of historical linguistics. That, in our view, is the only way that quantitative techniques can properly cross the chasm into adoption in mainstream historical linguistics. Such a positive case must go well beyond training manuals or statistics classes. Instead, the intellectual footwork for integrating numbers with the core questions that historical linguistics faces must be done.

By outlining a set of principles for integrating numbers and historical linguistics, we set out to provide the basis for such a positive case. These principles support a transparent, data-driven form of research, that can build upon and complement—but does not replace—traditional historical linguistics research. Further work is surely needed to drive this positive change, including how theories of linguistic change more directly can incorporate statistical components. Yet the present does look like a promising time for historical linguistics to cross the quantitative chasm.

Featured image credit: Binary damage code by Markus Spiske. CC0 via Flickr.

Recent Comments

  1. Rudy Troike

    Perhaps your book acknowledges the pioneering work of Morris Swadesh in glottochronology, which was unfortunately widely repudiated, though his formulas for rate of change have been tested and found accurate in cases as widely varied as Turkic and Quechua.

Comments are closed.