Apparent-time and spatial diffusion in large social-media corpora
Social media offers a historically-unparalleled data source for sociolinguistics and dialectology. The problem with these datasets for language change research is the difficulty of identifying enough metadata to trace diffusion. Previous variationist work using Twitter data deals largely with geography, raising the concern that patterns interact with non-geographic confounds. Here, we estimate user ages and consider the interaction between space and time.
Using a corpus of 104,657,500 Tweets from 1,734,260 users in the UK and Ireland, we assign 25.6% of users an age or age-category from mentions of birth year, age, family relationships and employment status. We consider the performance of these predicted ages over a set of variables in British English: loss of the preposition ‘to’ with ‘go’ and certain nouns (“go __(the) pub”); paradigmatic levelling to ‘was’ (“you was”); and de-levelling of “I/he/she were” to ‘was’. Apparent-time effects emerge that add to our understanding of each variable.