Comparing automated Implicit Motive models

2026-04-08

Comparing Implicit Motive Models

Part of my research involves using machine learning and natural language processing to automate content coding. In social science there has long been recognition that humans make inferences about the world based on information from a variety of sources. Much of what we infer about other people is based on what they say or how they say it. Whole subfields (i.e. sociolinguistics) are based on the study of how what we say reflects larger constructs (such as social categories like class).

Content coding and scale development

Content coding is one way of measuring variables in text that can be related to theoretical constructs, with many such scales being developed over the years for various purposes. I would note that while it is relatively easy to develop a scale (see some recent work on tools for scale development here), it is much more difficult to develop a reliable scale that consistently measures a construct of interest and can also be applied systematically.

In psychology, there has been a good bit of work on developing scales for measuring personality constructs. Most of these scales are based on self-report questionnaires, and while they can be effective (Big Five, MBTI) and are well-used by practitioners, they are subject to the general problems of self-reported data (i.e. social desirability, recall, and reference biases). In contrast, content coding can offer a more valid measure, as the participants are unaware of how what they write/say could affect the outcome.

In light of this, some early work on personality used content coding to derive personality profiles for individuals. This started primarily with the Thematic Apperception Test (TAT), which gets participants to generate stories based on prompts or images. The TAT was developed more systematically for personality assessment as the Picture Story Exercise (PSE) which is a well-validated means of measuring personality.

Implicit motives and automation

The PSE is a standardized set of pictures and related prompts that request respondents to write stories about the pictures they are presented with. Each story is then coded according to the theory of Implicit Motives, following the Winter (1994) manual, to give a motive score in three categories: Achievement, Affiliation, and Power. The motive profile of an individual is considered to be the relative total amount of Achievement, Affiliation, and Power imagery in their responses (often corrected for wordcount), and correlates with a host of behavioural outcomes.

Content coding of variables is an extremely time-consuming process and requires training, which becomes more burdensome with the complexity of the coding system. In the case of TATs and implicit motive coding, automation attempts have been made since the 1960s, but most approaches tried to do this with dictionaries. Since imagery does not depend solely on the meaning of individual words, such approaches largely failed to achieve the kind of reliable correlations that would allow them to replace the manual human-annotated method.

However, in the last few years, with the advent of more complex language models and related tools, automation of implicit motive coding seems to be within reach. As a case in point, there are now three (!!) models that can code text for implicit motives. There are some necessary caveats regarding interpretation of implicit motive codes for non-PSE texts, which I will address at the end of this post. The majority of the remainder of this post introduces the three models and walks through Python code that illustrates how they can be used to code for motives. We use the Winter (1994) benchmark dataset to compare their output and assess how well they match with human-coded scores.

Implicit Motive coding models

There are three recent models that apply machine learning and NLP techniques to classify text for implicit motives, using more recent transformer architectures and improving on efforts by Pang & Ring (2020). Brede et al trained a model using the setfit library, where sentence embeddings serve as features for a model trained on PSE data to classify text. Nilsson et al use a similar approach, finetuning a RoBERTa model on PSE data to classify text for individual motives. Pang et al use ELECTRA, trained on a combination of PSE and other data, to classify text for all three motives.

Each of the models are available for use by researchers, and all can be loaded within the transformers library, hosted on Huggingface (a community platform for AI models). Nilsson et al’s model is available via theharmonylab organization and can also be found as part of the text library for R. Brede et al’s model is part of the automatedMotiveCoder organization, and Pang et al’s ELECTRA model is available from encodingai. The remainder of this post illustrates how the models can be used easily via Python. While I do program in R as well as Python, Python is my primary programming language, and so for ease of illustration it’s what we’ll be using below. Additionally, we’ll be accessing the benchmark dataset from Winter (1994) as recommended by the Pang & Ring (2020) paper, and which is used for comparison in the Pang et al paper.

Loading the models in Python

The first thing we need to do is set up an environment for running the code, and to download the models themselves, which may require a (free) Huggingface account. I’ll assume that you have a virtual environment running Python 3.9 or higher (I’ve tested this code on both Python 3.9 & 3.10), and that you’re somewhat familiar with running terminal commands and Python scripts. As you follow along with this post, you should be able to copy/paste each code block into the same Python script sequentially, in order to run it in the terminal.

To make sure things are set up properly, with the virtual environment active, ensure that you have all the necessary libraries installed with the following command in the terminal.

pip install "transformers<5.0" setfit pandas openpyxl

The double quotes surround our transformers library and a flag that ensures we get a version of the library that is compatible with the setfit library (used for the Brede et al model). The pandas library is for handling dataframes and the openpyxl library allows us to work with excel spreadsheets. The use of transformers may also require you to create a (free) account on huggingface.co in order to download the models. Once you have ensured that you have the necessary libraries, we can get the actual models, which for all three models can be done with the following code.

from transformers import pipeline # load transformers pipeline
# load the electra model (~520mb on first download)
model1 = "encodingai/electra-base-discriminator-im-multilabel-V3"
# instantiate the classifier with the text-classification pipeline
classifier1 = pipeline("text-classification", model=model1)
name1 = "electra" # store the name for print statements

from setfit import SetFitModel # load the setfit pipeline
# load the setfit model (~2.25gb on first download)
model2 = "automatedMotiveCoder/setfit"
# instantiate the classifier with the setfit pipeline
classifier2 = SetFitModel.from_pretrained(model2)
name2 = "amc-setfit" # store the name for print statements

# load the RoBerta models (~1.42gb for each model, ~4.26gb total on first download)
# and instantiate each classifier with the text-classification pipeline
model3 = "theharmonylab/implicit-motives-achievement-roberta-large"
classifier3ach = pipeline("text-classification", model=model3)
model3 = "theharmonylab/implicit-motives-affiliation-roberta-large"
classifier3aff = pipeline("text-classification", model=model3)
model3 = "theharmonylab/implicit-motives-power-roberta-large"
classifier3pow = pipeline("text-classification", model=model3)
name3 = "roberta" # store the name for print statements


Note that the ELECTRA and RoBERTa models both use the transformers pipeline for text classification, while the AutomatedMotiveCoder (AMC) uses the setfit model parameters. Also, there is a difference in size and format - ELECTRA is a single 0.5gb model for multilabel classification, the AMC is a single 2.25gb model comprised of 4 “one-vs-rest” classifiers, and the RoBERTa model is actually 3 separate models (one for each motive) of 1.42gb each (4.26gb total).

Running this code will allow you to download the models. Once they are downloaded they will be stored in your cache and used for future inference. Now that we have the models, let’s test them on a single sentence to make sure everything is working properly. The sentence we will use is taken from the Winter (1994) training manual, and is scored for both Affiliation and Power.

# this is a sentence from the Winter manual that is double-scored for Power and Affiliation
sentence = """The recollection of skating on the Charles, and the time she had
            pushed me through the ice, brought a laugh to the conversation; but
            it quickly faded in the murky waters of the river that could no
            longer freeze over."""
# predict on the test sentence using the electra model and return probabilities
result = classifier1(sentence, top_k=3) # we want all three labels, not just the most likely
scores1 = {x['label']: x['score'] for x in result}
print(f'Probabilities for each label (test sentence, {name1}): {scores1}')
rounded1 = {k: int(round(v)) for k, v in scores1.items()} # round the scores
print(f'Rounded scores for each label (test sentence, {name1}): {rounded1}')

# Probabilities for each label (test sentence, electra): {'Aff': 0.9999632835388184, 'Pow': 0.9999274015426636, 'Ach': 2.489300641173031e-06}
# Rounded scores for each label (test sentence, electra): {'Aff': 1, 'Pow': 1, 'Ach': 0}

# predict on the test sentence using the setfit model and return probabilities
# the setfit model returns a score for each of the three motives and 'null'
# since we only want Ach, Aff, Pow, we only get the first three indices
result = classifier2.predict_proba(sentence).numpy() # ach, aff, pow, null
scores2 = {'Ach': float(result[0]), 'Aff': float(result[1]), 'Pow': float(result[2])}
print(f'Probabilities for each label (test sentence, {name2}): {scores2}')
rounded2 = {k: int(round(v)) for k, v in scores2.items()} # round the scores
print(f'Rounded scores for each label (test sentence, {name2}): {rounded2}')

# Probabilities for each label (test sentence, amc-setfit): {'Ach': 0.018920687518636643, 'Aff': 0.04094580107598384, 'Pow': 0.6419628438800001}
# Rounded scores for each label (test sentence, amc-setfit): {'Ach': 0, 'Aff': 0, 'Pow': 1}

# predict on the test sentence using the roberta model and return probabilities
# the roberta models return two probabilities - 'LABEL_0' is the probability of no classification
# and 'LABEL_1' is the probability of a classification
# since we only want the likelihood of classification, we get just the score for 'LABEL_1'
resultach = [{x['label']: x['score'] for x in classifier3ach(sentence, top_k=2)}['LABEL_1']] # achievement
resultaff = [{x['label']: x['score'] for x in classifier3aff(sentence, top_k=2)}['LABEL_1']] # affiliation
resultpow = [{x['label']: x['score'] for x in classifier3pow(sentence, top_k=2)}['LABEL_1']] # power
result = [dict(zip(['Ach', 'Aff', 'Pow'], item)) for item in list(zip(resultach, resultaff, resultpow))]
scores3 = result[0]
print(f'Probabilities for each label (test sentence, {name3}): {scores3}')
rounded3 = {k: int(round(v)) for k, v in scores3.items()} # round the scores
print(f'Rounded scores for each label (test sentence, {name3}): {rounded3}')

# Probabilities for each label (test sentence, roberta): {'Ach': 0.0005982170696370304, 'Aff': 0.47605016827583313, 'Pow': 0.38252514600753784}
# Rounded scores for each label (test sentence, roberta): {'Ach': 0, 'Aff': 0, 'Pow': 0}


If you have loaded everything correctly, you should see that all three models are able to score the sentence, with some differences in classification. While the ELECTRA and AMC models score the sentence for Power, only ELECTRA double scores the sentence for Affiliation as well. We can also observe that the RoBERTa model differs in its probability scores from the other models, approaching a correct double-classification but ultimately not hitting the 0.5 probability that would round up to 1.

Getting the Winter 1994 benchmark dataset and testing models

This is hardly a fair comparison, however, and test datasets can certainly affect the performance of a model, particularly depending on whether or not the test data has a similar distribution to the training data. To compare the models on a larger dataset we can make use of the benchmark dataset suggested by Pang & Ring (2020), which is available via this OSF link.

The data is from Winter (1994), the manual used to train implicit motive researchers in the coding system for identifying Achievement, Affiliation, and Power imagery in text. Pang & Ring split the training sets from the manual into sentences (1,358) and checked/rescored the data, ignoring the “second sentence rule” (see the paper for more details). Human coders are expected to achieve over 0.85 agreement with this dataset in order to be considered trained in the coding system and able to score other texts. We would expect an automated model to achieve similar agreement in order to be able to use it for research.

The following code assumes that you have downloaded the excel spreadsheet and renamed it as Winter1994_sentences.xlsx. Here we read the data into a dataframe, get the list of sentences, and classify the sentences using each model that we previously loaded. We then store each set of predictions in its own dataframe and check correlations with the original dataset.

import pandas as pd # import pandas to work with excel data
# Winter benchmark dataset: https://osf.io/6fnz5
dataset = "Winter1994_sentences.xlsx" # downloaded and renamed dataset

df = pd.read_excel(dataset) # read the data
texts = df['Text'].tolist() # get the sentences as a list

# predict probabilities for each sentence using electra model
result = classifier1(texts, top_k=3)
predictions = [] # store rounded predictions in this list
for res in result:
    predictions.append({x['label']: int(round(x['score'])) for x in res}) # rounded
edf = pd.DataFrame.from_records(predictions) # make a new dataframe to store the results
# print(edf.head())

# predict probabilities for each sentence using setfit model
result = classifier2.predict_proba(texts).numpy() # ach, aff, pow, null
predictions = [] # store rounded predictions in this list
for res in result:
    predictions.append({'Ach': int(round(res[0])), 'Aff': int(round(res[1])), 'Pow': int(round(res[2]))}) # rounded
setdf = pd.DataFrame.from_records(predictions) # make a new dataframe to store the results
# print(setdf.head())

# predict probabilities for each sentence using RoBerta model
resultach = [{x['label']: x['score'] for x in res}['LABEL_1'] for res in classifier3ach(texts, top_k=2)] # achievement
resultaff = [{x['label']: x['score'] for x in res}['LABEL_1'] for res in classifier3aff(texts, top_k=2)] # affiliation
resultpow = [{x['label']: x['score'] for x in res}['LABEL_1'] for res in classifier3pow(texts, top_k=2)] # power
result = [dict(zip(['Ach', 'Aff', 'Pow'], item)) for item in list(zip(resultach, resultaff, resultpow))]
predictions = []
for res in result:
    predictions.append({k: int(round(v)) for k, v in res.items()}) # rounded
robdf = pd.DataFrame.from_records(predictions)
# print(robdf.head())

# check correlations
print(f'Winter 1994 Pearson correlations for {name1}:\n{df.corrwith(edf, drop=True)}\n')
print(f'Winter 1994 Pearson correlations for {name2}:\n{df.corrwith(setdf, drop=True)}\n')
print(f'Winter 1994 Pearson correlations for {name3}:\n{df.corrwith(robdf, drop=True)}\n')

# Winter 1994 Pearson correlations for electra:
# Pow    0.784608
# Ach    0.855252
# Aff    0.770717
#
# Winter 1994 Pearson correlations for amc-setfit:
# Pow    0.606628
# Ach    0.593990
# Aff    0.639362
#
# Winter 1994 Pearson correlations for roberta:
# Pow    0.652232
# Ach    0.597013
# Aff    0.711778


Here we can see that the ELECTRA model shows higher Pearson correlations with this dataset. Arguably, Pearson is not the right metric to use for agreement, however, so let’s see if we can observe the relationship a bit differently.

Comparing results with visualization

The Pearson correlations can be easily returned using the built-in function from pandas as illustrated above. The library also has a way of visualizing actual counts of correspondences in a table. The code below flattens the Winter (1994) dataset and the predictions for each model, then calculates how many match. First we set up a Python dict to easily iterate through the dataframes - we will reuse this later. Then we get the raw counts of correspondences, i.e. the matches and discrepancies.

dfdict = {name1: edf, name2: setdf, name3: robdf} # set up a dict to easily compare dataframes

cols = ['Ach', 'Aff', 'Pow'] # the columns we're comparing
# flatten the two dataframes with 3 columns: 'Pow', 'Ach', 'Aff'
actual_flat = df[cols].melt(var_name='column', value_name='actual')['actual']
# for each dataframe of results, get a confusion matrix
for pred in dfdict.keys():
    pred_flat = dfdict[pred][cols].melt(var_name='column', value_name='predicted')['predicted']
    # get a confusion matrix of the correspondences using `crosstab`
    confusion_matrix = pd.crosstab(actual_flat, pred_flat,
                                   rownames=['Actual'],
                                   colnames=['Predicted'])
    print(f'Confusion matrix for {pred} (totals):\n', confusion_matrix) # visualize

# Confusion matrix for electra (totals):
#  Predicted     0    1
# Actual              
# 0          3537   64
# 1           100  373
#
# Confusion matrix for amc-setfit (totals):
#  Predicted     0    1
# Actual              
# 0          3456  145
# 1           168  305
#
# Confusion matrix for roberta (totals):
#  Predicted     0    1
# Actual              
# 0          3414  187
# 1           115  358


You might notice that this doesn’t differentiate between the different labels (‘Ach’, ‘Aff’, ‘Pow’), and it might be helpful to visualize in a different way. To get the correspondences by motive code, we need a slightly more complex approach. For this we can access the scikit-learn library and for visualization we can use matplotlib and seaborn. Ensure these libraries are installed in your virtual environment with the following terminal command:

pip install scikit-learn seaborn matplotlib

The following code imports the necessary libraries and functions. It then defines a function to create a figure with subplots for each motive.

# import libraries and functions
from sklearn.metrics import multilabel_confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

def plot_confusion_matrix(confusion_matrix, axes, class_label, class_names, fontsize=12):
    """
    Plot a confusion matrix for each class label, taking the computed matrix, axes, names as arguments.
    """
    # make a dataframe for the confusion matrix results
    df_cm = pd.DataFrame(
        confusion_matrix, index=class_names, columns=class_names,
    )
    try:
        # use the heatmap plotting function to visualize
        heatmap = sns.heatmap(df_cm, annot=True, fmt="d", cbar=False, ax=axes)
    except ValueError:
        raise ValueError("Confusion matrix values must be integers.")
    # populate the labels
    heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
    heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
    axes.set_ylabel('True label')
    axes.set_xlabel('Predicted label')
    axes.set_title("Confusion Matrix for " + class_label)


The code below puts this all together. We first create a list of the actual codes (the Winter 1994 benchmark data), and then for each model’s results we construct a multilabel matrix with the actual codes and the predicted codes (using the dict that we instantiated earlier). Finally, we plot each result as a heatmap using the plot_confusion_matrix function defined above. Here we are simply showing the plot as a pop-up window, but you can also save the figure to a file by uncommenting the relevant line.

# make a list of the actual (expected) codes
y_expected = [list(map(float, z)) for z in zip(df['Pow'], df['Ach'], df['Aff'])]

# go through each model
for pred in dfdict.keys():
    # make a list of the predicted codes
    y_pred = [list(map(float, z)) for z in zip(dfdict[pred]['Pow'], dfdict[pred]['Ach'], dfdict[pred]['Aff'])]
    # make a multi label confusion matrix
    matrix = multilabel_confusion_matrix(y_expected, y_pred)
    # visualize in the terminal
    print(f'Confusion matrix {pred} model')
    confusion_matrix_A = matrix[0]
    print(confusion_matrix_A)
    confusion_matrix_B = matrix[1]
    print(confusion_matrix_B)
    confusion_matrix_C = matrix[2]
    print(confusion_matrix_C)
    # make a subplot for each motive code
    fig, ax = plt.subplots(1, 3, figsize=(9, 4))
    # plot the results for each motive
    for axes, cfs_matrix, label in zip(ax.flatten(), matrix, cols):
        plot_confusion_matrix(cfs_matrix, axes, label, ["0", "1"])
    # make a title
    fig.suptitle(f'Predictions on Winter dataset for {pred} model')
    fig.tight_layout() # tighten the layout
    # plt.savefig(f'Winter_confusion_matrix_{pred}.png') # save the plot
    plt.show() # show the plot

    ## uncomment the code below to print a classification report for each model
    # print(classification_report(y_expected, y_pred, output_dict=False, target_names=['Pow', 'Ach', 'Aff']))

# Confusion matrix electra model
# [[1148   21]
#  [  46  143]]
# [[1226   20]
#  [  11  101]]
# [[1163   23]
#  [  43  129]]
#
# Confusion matrix amc-setfit model
# [[1105   64]
#  [  64  125]]
# [[1223   23]
#  [  52   60]]
# [[1128   58]
#  [  52  120]]
#
# Confusion matrix roberta model
# [[1086   83]
#  [  42  147]]
# [[1217   29]
#  [  48   64]]
# [[1111   75]
#  [  25  147]]


A common set of metrics in machine learning are ‘Precision’ [true pos / (true pos + false pos)], ‘Recall’ [true pos / (true pos + false neg)], and ‘F1 score’ [(2 * Precision * Recall) / (Precision + Recall)] - scores closer to 1.0 are better. These metrics can be computed using sklearn’s built-in classification_report. Uncommenting the code in the last line of the loop above produces the following reports for our datasets.

# Classification report electra model
#      precision    recall  f1-score   support
#
# Pow       0.87      0.76      0.81       189
# Ach       0.83      0.90      0.87       112
# Aff       0.85      0.75      0.80       172

# Classification report amc-setfit model
#      precision    recall  f1-score   support
#
# Pow       0.66      0.66      0.66       189
# Ach       0.72      0.54      0.62       112
# Aff       0.67      0.70      0.69       172

# Classification report roberta model
#      precision    recall  f1-score   support
#
# Pow       0.64      0.78      0.70       189
# Ach       0.69      0.57      0.62       112
# Aff       0.66      0.85      0.75       172


Another common procedure for identying correspondence between coders/datasets is to use the Intra-class Correlation Coefficient (ICC). The pandas library does not have a built-in function for this, but the metric can be computed using the pingouin library. Since this post is already quite long, I won’t be illustrating the process here, but may write another post in the future to show how this can be computed.

Conclusion

Here I have illustrated how you can automatically code text for implicit motives using three existing models. On my Macbook pro, running the complete code (loading all three models and running inference on the benchmark dataset) takes about 2 minutes, once the models have been downloaded. There are a few caveats regarding interpretation that we should be aware of, however.

The main issue is that the text we’re using for our benchmark was generated using the PSE task. This also reflects the kind of data that the models were trained on, and means that the test data is likely to reflect the distribution (more or less) of the training data. Pang et al (2026) also highlight this, showing that while these models perform decently on text generated via the PSE, text generated in a different context (i.e. political speeches) results in lower correlations for the various models.

Since most of the research on implicit motives uses the PSE task, they should be quite useful in allowing for the automation of PSE coding. Still, the fact that not all models perform to the same degree on a particular PSE dataset highlights the need for independent evaluation of a given model on such datasets. Pang et al (2026) recommend human-coding a portion of any new dataset and assessing the model’s performance against a human coder before continuing to (auto-) code the remainder. Beyond the context of the PSE, more research is needed. Human coders are pretty good at adjusting to implicit motive imagery in different kinds of texts, but existing automated implicit motive coding models still need a bit of work to be able to do so consistently.