Controlling for Descent and Area in linguistic typology

2026-03-17

When conducting comparative studies on linguistic features, we are essentially observing patterns. In a previous post I mentioned one problem we need to deal with, related to how we represent the features we want to compare. For comparison based on lexical items, our gold standard is transcription in the IPA. Related to this, we can make the observation that patterns in a given language have developed over time, whether patterns in sound structures and related word forms, or patterns in word classes and grammatical structures. While transcriptions and grammatical descriptions (and corpora more generally) reflect a snapshot of language at a given timepoint, the development of these forms may follow relatively idiosyncratic paths.

Since languages (and features in languages) develop at different rates under different conditions, one of the challenges becomes how to identify which changes are consistent with universal properties and which are specific to a particular context. We can also observe that changes in a given language may be due to language-internal or family-internal structural properties, or that the changes may be due to influence from other languages spoken nearby. The potential effect of language family and/or geographical area on language (and cultural) change is known as Galton’s problem, and has long been a concern for linguists and others who research aspects of human culture and behavior. How do we disentangle apparent developments from influences brought about largely by the context in which they occur?

Controlling for descent and area

Because of this concern regarding possible autocorrelation, it is now relatively common to attempt to control for effects of descent (language family) and for geography (language area) in typological studies. Early typological studies did this mainly by sampling (see Greenberg 1963/1990, Dryer 1989, De Garbo et al 2023) in part due to lack of data. The potential for statistical autocorrelation is less of an issue with large datasets, but given that most cultural/linguistic comparison has been (and in many cases continues to be) conducted with small sample sizes, there are various approaches that have been suggested to deal with this. The most comprehensive discussion I have seen regarding different approaches is in Guzman Naranjo and Becker (2022). The general observation is that any linear relationship or correlation between linguistic variables may be confounded by the simple fact of descent from a common ancestor or the languages being spoken in the same geographical area. One relatively simple way to control for this is by including language family and geographical area as variables in a linear regression via a mixed model.

The rest of this post illustrates one such approach in Python. Here I walk through how to choose variables, how to read and format the data using pandas, and how to conduct both standard linear regressions and linear mixed-effects regressions using the statsmodels library. I show that this approach can duplicate findings from current typological research.

Choosing variables and getting data

In order to observe the potential effect of language family and area on particular typological features, we need some features to test. Fortunately for us, a recent paper (Verkerk et al 2025) has investigated a large number of purported universals using data from Grambank, which contains information on grammatical structures in a large number of languages. The dataset they used for testing each universal is available on Github, and contains numbering for each universal as well as datasheets containing observed languages and codes that correspond to presence/absence of the relevant variables.

For the purpose of illustration, we will select the purported universal #005KA (“If a language has dominant SOV order and the genitive follows the governing noun, then the adjective likewise follows the noun.”) - which is an “implicational universal”, a subset of statistical universals in linguistic typology. In the Verkerk et al study, this was found to be significant in a Bayesian analysis without family/area controls, but when language family and area were controlled for, the effect was non-significant. The datasheet for this feature contains a list of Glottocodes for languages, followed by two columns coded for presence/absence based on the Grambank data: “GB133_GB065” (whether a language is SOV and the genitive follows the noun), and “GB193” (whether the adjective follows the noun).

The following python code assumes you have downloaded the data from the linked Github repository and renamed it as 0005KA-BT_data.txt, and uses pandas to read it into a dataframe.

import pandas as pd # import pandas for tabular data

featfile = "0005KA-BT_data.txt" # tab delimited file with relevant features
# read the csv and name the columns
fdf = pd.read_csv(featfile, delimiter="\t", header=None, names=["glottocode", "GB133_GB065", "GB193"])
print(fdf.head()) # visualize the data in the terminal

# glottocode  GB133_GB065  GB193
# 0   abun1252            0      1
# 1   area1240            0      1
# 2   axam1237            0      1
# 3   cash1251            1      0
# 4   chuk1273            0      0
# ...

We then want to match up our dependent variable (SOV order + [noun > genitive]) with the independent variable ([noun > adj]) for each language in the dataset. In the case of the Verkerk et al data, the relevant tags have already been coded, which makes it easy to extract. If you are working with other variables, this might take a bit more work.

The next step is to get family and area information for each language in the dataset. For this we can use the language classifications and macroarea categories from Glottolog, which are updated regularly and can be accessed programmatically using the pyglottolog library. This data is also available on Github for the dataset that we are working with, which makes it easier for us. The code below assumes you have downloaded this file into the same directory.

glottofile = "Glottolog_Languages.csv" # comma-separated file with glottocodes and other info
gdf = pd.read_csv(glottofile) # read in the data
print(gdf.head()) # visualize

# glottocode        name isocodes  ... longitude Family_ID               Family_name
# 0   3adt1234  3Ad-Tekles      NaN  ...       NaN  afro1255              Afro-Asiatic
# 1   aala1237      Aalawa      NaN  ...       NaN  aust1307              Austronesian
# 2   aant1238  Aantantara      NaN  ...       NaN  nucl1709  Nuclear Trans New Guinea
# 3   aari1239        Aari      aiw  ...   36.5721  sout2845              South Omotic
# 4   aari1240      Aariya      aay  ...       NaN  book1242               Bookkeeping
# ...

This is more data than we need, so the following code reduces our dataset to only what we are interested in: the relevant languages with the variables of interest and the columns with language family and macroarea information. Note that the data contains roughly 50 isolates (languages with no known relatives) which share NaN as their “Family_name” identifier. In the code below we give each isolate a unique identifier to avoid creating a spurious grouping of unrelated languages.

# reduce the Glottolog data to only grouping factors
gdf = gdf[['glottocode', 'macroarea', 'Family_name']]
# merge this with the feature data in a new dataframe
df = pd.merge(fdf, gdf, on='glottocode', how='inner')
# replace NaN values under 'Family_name' (isolates) with unique identifiers
df['Family_name'] = df['Family_name'].fillna(df['glottocode'])
print(df.head()) # visualize

# glottocode  GB133_GB065  GB193      macroarea          Family_name
# 0   abun1252            0      1      Papunesia             abun1252
# 1   area1240            0      1      Papunesia         Austronesian
# 2   axam1237            0      1      Papunesia         Austronesian
# 3   cash1251            1      0  South America         Pano-Tacanan
# 4   chuk1273            0      0        Eurasia  Chukotko-Kamchatkan
# ...

Linear regression (OLS)

We can then set up a standard linear regression with our dependent and independent variables using the statsmodels Python library. The standard linear regression (ordinary least squares) is our base case, without controls. Here we see that the correlation between the dependent and independent variables is significant (p < 0.001).

import statsmodels.formula.api as smf # import statsmodels with the formula api

dv = 'GB133_GB065'  # dependent variable (SOV order + [noun > genitive])
iv = 'GB193'        # independent variable (noun > adj)
formula = f'{dv} ~ {iv}' # final formula

print("standard linear regression (OLS)")
## model without grouping
model = smf.ols(formula, df) # OLS model
result = model.fit().summary() # run the linear model and get the summary
print(result) # visualize

# standard linear regression (OLS)
#                             OLS Regression Results                            
# ==============================================================================
# Dep. Variable:            GB133_GB065   R-squared:                       0.028
# Model:                            OLS   Adj. R-squared:                  0.028
# Method:                 Least Squares   F-statistic:                     52.19
# Date:                Mon, 16 Mar 2026   Prob (F-statistic):           7.43e-13
# Time:                        08:26:43   Log-Likelihood:                -1126.2
# No. Observations:                1786   AIC:                             2256.
# Df Residuals:                    1784   BIC:                             2267.
# Df Model:                           1                                         
# Covariance Type:            nonrobust                                         
# ==============================================================================
#                  coef    std err          t      P>|t|      [0.025      0.975]
# ------------------------------------------------------------------------------
# Intercept      0.4011      0.017     23.713      0.000       0.368       0.434
# GB193         -0.1584      0.022     -7.224      0.000      -0.201      -0.115
# ==============================================================================
# Omnibus:                     1764.316   Durbin-Watson:                   2.034
# Prob(Omnibus):                  0.000   Jarque-Bera (JB):              302.212
# Skew:                           0.804   Prob(JB):                     2.37e-66
# Kurtosis:                       1.785   Cond. No.                         2.91
# ==============================================================================

Linear mixed model with grouping

Because we want to observe multiple possible influences on the data, we then use the “linear mixed model” implementation. This requires a “groups” argument that considers the effect of grouped observations on possible correlations. To control for family and area, we can take two approaches. The first is to include either family or area as a grouping factor, individually, is done with the following code.

## models using each group separately as controls
groups = ['Family_name', 'macroarea'] # a list of grouping factors to use as random effects
for g in groups:
    print("grouping variable:", g)
    ## model a single group with a mixed linear model
    model = smf.mixedlm(formula, df,
                        groups=df[g],
                        )
    result = model.fit().summary() # run the model and get the summary
    print(result) # visualize

# grouping variable: Family_name
#           Mixed Linear Model Regression Results
# =========================================================
# Model:            MixedLM Dependent Variable: GB133_GB065
# No. Observations: 1786    Method:             REML       
# No. Groups:       232     Scale:              0.0917     
# Min. group size:  1       Log-Likelihood:     -576.9150  
# Max. group size:  391     Converged:          Yes        
# Mean group size:  7.7                                    
# ----------------------------------------------------------
#            Coef.   Std.Err.    z     P>|z|  [0.025  0.975]
# ----------------------------------------------------------
# Intercept   0.442     0.031  14.437  0.000   0.382   0.502
# GB193      -0.026     0.018  -1.454  0.146  -0.062   0.009
# Group Var   0.139     0.065                               
# =========================================================
#
# grouping variable: macroarea
#           Mixed Linear Model Regression Results
# =========================================================
# Model:            MixedLM Dependent Variable: GB133_GB065
# No. Observations: 1786    Method:             REML       
# No. Groups:       6       Scale:              0.1813     
# Min. group size:  55      Log-Likelihood:     -1023.9012
# Max. group size:  511     Converged:          Yes        
# Mean group size:  297.7                                  
# ----------------------------------------------------------
#            Coef.   Std.Err.    z     P>|z|  [0.025  0.975]
# ----------------------------------------------------------
# Intercept   0.344     0.074   4.625  0.000   0.198   0.490
# GB193      -0.063     0.022  -2.890  0.004  -0.105  -0.020
# Group Var   0.031     0.048                               
# =========================================================

Here we see that when we control for language family, the correlation is not significant (p = 0.146). But when controlling for only macroarea the correlation is still significant (p = 0.004).

The second approach is to treat all languages/observations as a single group, and include language family and macroarea as random effects in the linear regression.

print("random effects: Language Family + Area")
## mixed linear model with multiple random effects
vc_form = {g: f"0 + C({g})" for g in groups} # dict to store values
df['group'] = 1 # set group to `1` in order to model multiple random effects
model = smf.mixedlm(formula, df,
                    groups=df['group'],
                    vc_formula=vc_form,
                    )
result = model.fit().summary() # run the linear model and get the summary
print(result) # visualize

# random effects: Language Family + Area
#           Mixed Linear Model Regression Results
# ==========================================================
# Model:             MixedLM Dependent Variable: GB133_GB065
# No. Observations:  1786    Method:             REML       
# No. Groups:        1       Scale:              0.0917     
# Min. group size:   1786    Log-Likelihood:     -568.6317  
# Max. group size:   1786    Converged:          Yes        
# Mean group size:   1786.0                                 
# ----------------------------------------------------------
#                 Coef.  Std.Err.   z    P>|z| [0.025 0.975]
# ----------------------------------------------------------
# Intercept        0.402    0.076  5.307 0.000  0.254  0.551
# GB193           -0.029    0.018 -1.610 0.107 -0.064  0.006
# Family_name Var  0.117    0.056                           
# macroarea Var    0.029    0.075                           
# ==========================================================

Once family and area are included together as controls, the independent variable (noun > adj) is not significantly correlated (p = 0.107) with the dependent variable (SOV + [noun > gen]). This suggests that such developments are not independent of descent or language contact, and thus do not reflect a universal property of language.

Some thoughts

The goal of this post has been to highlight some of the concerns relevant to linguistic typology when it comes to understanding how languages change and evolve over time, and to provide some Python code to implement controls for language family and area with a simple example. Galton’s problem illustrates a more general property of linguistic data related to collinearity/multicollinearity, where multiple features may all be correlated with the same variables, making it difficult to disentangle the primary cause or driver of change.

I should note that there are various assumptions we are making here about the data, the primary one being that it is normally distributed. This kind of analysis is supported in part because we have a large number of observations (1,786), and if we had fewer, given the large number of covariates, it is possible that our models wouldn’t converge. We can also observe that this is essentially categorical data, and if we were working with continuous variables (gradients or proportions of some kind) we would likely want to scale the values to avoid certain statistical pitfalls. For individual problems, a slightly different approach might be necessary, and there is some debate about what measure or grouping criteria should be used in linguistic typology.

But hopefully this exercise has at least illustrated that implementing controls for possible effects from language family or area is not a difficult task. And by using such techniques, particularly with freely available data, we can verify findings and get closer to understanding the mechanisms of language change that drive the typological similarities and differences we find in the world’s languages.