Mapping word frequencies on Twitter using R and Python In this workshop we demonstrate how to map word frequencies on American Twitter in R and Python and discuss the relevance of this approach to the study of language variation and change.
In the first half of the workshop we go through the process of mapping word frequencies step-by-step. We first introduce a sample dataset, which consists of the relative frequencies of the top 10,000 words in a multi-billion word corpus of geolocated American Twitter collected between 2013-2014 measured across 3,076 counties in the contiguous United States (see Grieve et al. 2018). We then show how to load and map this dataset and how to conduct basic forms of global and local spatial analysis to help identify regional patterns in dialect maps. Although we provide parallel code in both Python and R, we focus primarily on implementation in R for this workshop.
In the second half of the workshop we consider the wider implications to dialectology and sociolinguistics of the analysis of word frequencies — an increasingly common approach in computational sociolinguistics (Nguyen et al., 2016). One of the fundamental methodological tenets of our field is the principle of accountability (Labov 1972), which requires that we not only identify all tokens of the form under analysis in a corpus, but all contexts where an equivalent variant form had been used in its place. Although the analysis of word frequencies violates this principle, we argue this approach offers new insights about regional lexical and grammatical variation that cannot be arrived at through the analysis of sociolinguistic alternation variables, which can be difficult to define above the levels of phonetics and phonology (Lavandera 1977).
All data and code will be shared with participants ahead of the session, making the workshop fully replicable. Although not necessary, participants are encouraged to bring a laptop with the data and code downloaded.