Data-projects-with-R-and-GitHub

Ziyi’s Project Description: Surprisal-Based Comparison of Human and LLM Language Generation

1. Background

Recent research in psycholinguistics suggests that language comprehension is strongly influenced by prediction. When reading or listening to a sentence, humans continuously generate expectations about upcoming words.

A key concept used to model this process is surprisal. Surprisal measures how unexpected a word is given its context. It is defined as:

Surprisal = -log(P(word context))

Higher probability → lower surprisal
Lower probability → higher surprisal

Surprisal concept

This means that words that are highly predictable have low surprisal, while unexpected words have high surprisal.

The surprisal theory of sentence processing was formalized by Levy (2008), who proposed that the difficulty of processing a word in a sentence is proportional to its surprisal value. Later work by Smith & Levy (2013) further demonstrated that reading time increases linearly with surprisal.

Predictive processing

In recent years, Large Language Models (LLMs) such as GPT models have become powerful tools for generating language. However, it is still an open research question whether the statistical properties of language generated by LLMs resemble those of human language production.

LLMs’ Training

This project explores this question by comparing surprisal values of human-generated and LLM-generated sentence continuations under controlled contexts.


2. Research Question

The main research question of this project is:

Do human speakers and Large Language Models generate language with similar surprisal patterns under the same contextual conditions?

More specifically, the project investigates:


3. Dataset (by folder)

3.1 LLMs Processed Data

This folder contains sentences generated by two LLM generators under different path conditions.

3.2 Processed final data (main focus)

The dataset used in this project was collected in the course “Verarbeitung kreativer Sprache mit ChatGPT” in the previous semester.

The data contains sentence continuations generated by:

Example structure of the dataset:

item context condition generator surprisal
1 high RA GPT-3.5 6.32
2 high EW human 5.12
3 medium OH GPT-5.2 7.41

The dataset is structured at the token level, meaning that each row corresponds to a single token within a sentence rather than a full item.

Each item (i.e., sentence continuation) consists of multiple tokens, and surprisal values are computed per token.

The dataset includes the following variables: (you can find the final dataset in ZiyiWang/data/Processed_final_data) - item: sentence ID - token: token index within the sentence - surprisal: surprisal value per token - generator: model or human source - context: contextual constraint (high vs. medium) - condition: particle condition (EW, RA, none)

In this project, context and condition are treated as independent variables:

Together, they form a 2 × 3 experimental design.

3.3 original pure data

Here is an initial data table containing the initial conditions, contextual settings and the opening sentence.

3.4 surprisal calculation

This folder contains the surprisal values for the tokens in each sentence of the continuations generated by each generator using llm_generate.py.


4. Data Processing Plan

Before the analysis, the dataset will be cleaned and standardized.

The main preprocessing steps include:

1. Handling missing data

It is evident that the human participants were only tasked with formulating responses to the 34 items. Consequently, in order to facilitate a more robust comparison between human performance and that of LLMs, it is recommended that the values for items 13 and 28 be removed from the analysis. Which not a problem for the analyse between LLMs generators.

2. Aggregating surprisal values

5. Visualization Plan

To compare surprisal patterns across generators, the results will be visualized using line plots.

The visualization will show:

x-axis: context condition

y-axis: mean surprisal value

lines: different generators (human vs LLMs)

This plot will allow us to observe whether human-generated language and LLM-generated language show similar or different surprisal patterns across contexts.

3.5 vs. 5.2 result context 3.5 vs. 5.2 result
particle

3.5 vs. 5.2 vs. human result context

6. Expected Contribution

References

Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106(3), 1126–1177.

Smith, N. J., & Levy, R. (2013). The effect of word predictability on reading time is logarithmic. Cognition, 128(3), 302–319.

Lu, Jeuring, Gatt (2025). Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues.