Student projects

The following Masters/PhD students are currently working on a project (co-)supervised by Andreas Baumann - MA, PhD, Data Analysis Project (DAP) (Uni Wien), or Interdisciplinary Project Data Science (IPDS) (TU Wien) - in the field of Digital Linguistics.

A list of announced/open project topics can be found below as well. Feel free to contact me in case you are interested.

Important information for MA projects (in German): Hinweise zu Abschlussarbeiten in der Digitalen Linguistik

Project announcements

MA-thesis (Germanistik)

Austrian dialect competences of LLMs

Large language models (LLMs) seem to have some potential when it comes to processing dialect data (Ingram, 2025, AI for community). However, a study on English dialects has shown that LLMs are prone to amplifying sterotypes encoded in the model's underlying training data (Fleisig et al., 2025, arXiv:2406.08818). In this project, the competences of conversational agents based on state-of-the-art LLMs (e.g., LeChat, ChatGPT, etc.) will be analyzed through a study design that combines conversational interactions with qualitative interviews conducted with a set of speakers of dialects spoken in Austria.

Prerequisites: German; knowledge of Austrian dialectology; know-how in qualitative research based on interviews is absolutely mandatory

This project is part of  the DIGILINGDIV project. Students are expected to participate in project meetings on a regular basis.

MA thesis

Creating a gold standard for historical emotion analysis

Computational methods for inferring the emotional semantics of a word backwards in time require gold standard data for the purpose of model evaluation. In their pioneering word, Hellrich et al. (2018, arxiv.org/abs/1806.08115) have created such datasets for English and German. By asking linguistically trained informants about their subjective assessment of the emotional semantics of words (valence, arousal, dominance) in the 1830s, they successfully collected gold standard emotion lexica. However, the number of annotators was limited in this study. The goal of this thesis project is to create a temporally layered historical emotion lexicon that draws on a larger number of expert ratings. The project can be conducted in English or German.

Prerequisites: ideally some experiences with survey tools (e.g., SoSciSurvey), basic data analysis  for inter-annotator agreement 

MA-thesis, DAP, IPDS

Lexical change in a multilingual diachronic text corpus

Languages are constantly subject to change. This is most clearly visible in the words that we use. While some words enter the lexicon (i.e., the shared vocabulary of a speech community) and become more frequent, others get less frequent and might even vanish. But do these processes apply uniformly across languages and do words spread faster in languages with more strongly connected speech communities? The goal of this project is to examine rates of lexical change in different languages. For this, a size-balanced diachronic multilingual corpus is first constructed by sampling texts from Wikipedia. In a second step, rates of lexical change and fluctuation are measured through statistical analysis in all languages. Finally, measures of how strongly speakers are connected to each other (e.g., population density) will be correlated with the average rates of lexical change per language.

Prerequisites: R or Python, ideally experience with text processing and the Wikipedia API, statistical analysis

This project is part of  the DIGILINGDIV project. Students are expected to participate in project meetings on a regular basis.

MA thesis, IPDS

Creating NLP ressources for the study of semantic change in Afrikaans

Afrikaans is a Germanic language spoken in Sub-Saharan Africa, in particular in South Africa and Namibia. While synchronic corpus resources and pre-trained semantic models exist for Afrikaans (e.g., provided by SADiLaR), there are currently no off-the-shelf resources for conducting diachronic analyses of Afrikaans semantics. The goal of this project is to fill this gap by deriving word embeddings (static and, if time and infrastructure allow for it, embeddings on the token level) from diachronic Afrikaans corpus data (Kirsten 2019). Preprocessing will involve usage of the afribrooms UDPipe model or similar. 

Prerequisites: solid knowledge of NLP and distributional semantics, Python/R

Current projects

Lale Tüver (MA-thesis project, MA Digital Humanities)

Mapping Meaning in Indo-European Languages: An Embedding Based Semantic Network Analysis Using OpenSubtitles

Language organizes meaning through networks of relations among words. Traditional lexical semantics studies synonymy, antonymy, hierarchies, and fields, while historical linguistics compares Indo-European languages using shared innovations. Recent computational methods model meaning via distributional embeddings and analyze system-level structure with semantic networks. Yet links between semantic structure and genealogical relatedness remain underexplored: prior work often relies on lexical overlap or parallel translations, capturing alignment rather than internal organization. This thesis bridges these strands by deriving semantic networks from word embeddings for Indo-European languages and examining graph-theoretic properties. It asks whether related languages share network characteristics and whether such properties recover recognized subgroupings, using hierarchical clustering over structural profiles for comparison.

Amandine Grieshaber (MA-thesis project, MA Digital Humanities)

Multilingualism in Digital and Analog Spaces: Development and Application of a Standardized Questionnaire to Assess Language Choice and Language Motivation in Multilingual Contexts

The ongoing digital transformation is reshaping people's behavior and communication, creating tension between digital communication and linguistic diversity. While a global decline in linguistic diversity is observed, digital platforms like social media fundamentally alter linguistic interaction conditions, differing from analog situations. This is especially relevant in multilingual contexts where speakers regularly switch between different languages. The key question is how digital spaces influence language choice and use, particularly for minority language speakers whose behavior varies across social, cultural, and functional contexts. Digital media offers dual opportunities: it can increase minority language visibility but often favors dominant languages. This research project will examine language choice in communication contexts and develop a standardized questionnaire to assess language choice and motivation in multilingual settings.

This MA-thesis is part of the DIGILINGDIV project.

Nadejda Rubinskii (MA-thesis project, MA Digital Humanities)

Making Sense of Meaning: Dynamics of Polysemy in German

Polysemous words carry multiple senses, such as German Maus (rodent or computer device). Most words are polysemous, with senses evolving as language changes. Quantifying senses lacks consensus among lexicographers. Polysemy affects language broadly: concrete words more often become polysemous, and it facilitates language acquisition in children through mental sense overlap and foreign language learning. This thesis aims to identify senses contextually for German and verify these relationships. It extracts target word contexts from the digitale Wörterbuch der deutschen Sprache across all decades of the twentieth century, selecting words based on existing research on concreteness and age of acquisition. Using the BERT-based sentence embeddings and the unsupervised k-means clustering, the study derives the sense counts and the polysemy development trends, finally correlating these with the age of acquisition and the concreteness to analyze the behavior of polysemous words in German.

Pia Pilsbacher (MA-thesis project, Deutsche Philologie)

Epidemiological compartment models in discrete time for studying language change

Compartmental ODE models established in the field of epidemiology have been used to study the diffusion of linguistic innovations, such as the spread of new words. Most approaches employ models in continuous time (Nowak 2000). The goal of this project is to examine to what extent compartment models in discrete time (i.e., systems of difference equations) can be reasonably transferred from epidemiology to the study of language change. One interesting feature of some discrete-time models studied in epidemiology (in particular, SIS models) is that they can display periodic behavior, i.e., periodically fluctuating trajectories (Allen 1994, Mathematical Biosciences). Do we find such behavior in linguistic dynamics and can discrete-time models be meaningfully employed to account for it?

 

 

Stefan Ceska (MA-thesis project, MA Digital Humanities)

Digital language death 2.0

There are approximately 7,000 languages spoken worldwide, many of which face varying degrees of endangerment. In his seminal paper “Digital Language Death”, Kornai (2013, PLOS One, 8(10), e77056) explored the relationship between language stability and the extent to which languages are represented in digital spaces. One of the key findings was that the vitality of a language—measured by the number of speakers, the volume of Wikipedia articles available in that language, and the level of institutional support it receives—are positively correlated. Since the publication of this study over a decade ago, the digital landscape has undergone significant transformation, raising new questions about the evolving relationship between language vitality and digital representation.

In this project, Kornai’s (2013) study will be replicated using updated data from Wikipedia (to be collected during the project) and Ethnologue, a comprehensive language database that provides information on language endangerment levels and speaker populations. 

Antonia Röper (MA-thesis project, MA Lehramt German)

Gaming and reading competence

This thesis explores the connection between video gaming and digital reading skills, focusing on 13-year-old students. As digital literacy becomes essential, gaming—often seen as mere entertainment—is recognized for developing key e-literacy skills such as navigation, problem-solving, and information processing. Previous research suggests that gamers, especially those who play single-player games, may outperform non-gamers in digital reading due to enhanced navigation and comprehension skills. The study aims to compare reading performance and digital navigation between gamers and non-gamers to assess this link.

Sarah Bloos (MA-thesis project, MA Digital Humanities)

Do you speak “Grant”?

The Viennese Grant is perhaps the most popular sociolinguistic stereotype about the city – but there’s only little known about how it’s really perceived. Collecting data from participants from Austria, Germany and Switzerland, I’m attempting to capture Grant using a dimensional emotion model (VAD). Further interest lies on possible correlations of sociolinguistic variables like age or gender and the respective perception of Grant, eventually leading to observing culturally or demographically varying clusters of understanding.

Paul Schmitt (Interdisciplinary Project in Data Science, MSc Data Science, TU Wien)

Dynamics of polysemy in German

The goal of this project is to develop a word-sense disambiguation model in order to analyze dynamics in polysemy in German. For this, contexts of words will be extracted from DWDS (https://www.dwds.de/d/korpora/kern) for all decades in the 19th and 20th century, and subsequently analyzed via BERT-based sentence embeddings to derive a measure of increase/decrease in polysemy. 

Juliane Benson (PhD project, co-supervised with Julia Neidhardt)

The evolution of the linguistic diversity in Canada

This dissertation examines linguistic diversity with three interconnected focuses. First, it extends diachronic study of Canada by comparing provincial language diversity over recent decades using five‑year census data and diversity metrics. Second, the project investigates how colonial history shaped provincial and global language diversity by building a database of colonial start/end dates, colonizers, and colonization types, expanding COLDAT and using historical maps to enable correlation analyses. Third, multilingualism in the digital age is explored, developing surveys to capture non‑first‑language use and online presence, drawing on LEAP‑Q, LHQ3 and CILD‑Q to design a new questionnaire. So, the dissertation addresses Indigenous language loss and aims to inform reconciliation and language policy.

Markus Pluschkovits (PhD project, co-supervised with Alexandra Lenz)

Realizations of the Progressive Aspect in German: Form, Function and Variation

This dissertation project is concerned with the different realizations of progressive aspectuality in contemporary German. Taking a cognitive and sociolinguistic approach, the aim of the project is to use quantitative methodology to investigate the steering factors behind the choice of specific constructions to encode actions being in progress.

Claudia Mattes (PhD project, co-supervised with Alexandra Lenz)

The gehören-passive. A corpus linguistic approach to the analytic construction gehören + participle II

The non-canonical passive form, comprised of gehören and the past participle of a verb, hasn’t been extensively researched so far. With the approach through digital methods in different corpora, the aim of this thesis is to better understand the construction in its different aspects, namely its development, the current grammaticalization and the semantic-pragmatic usage.

Past projects

Nadia Rubinskii & Amandine Grieshaber (Data analysis project, WS2025, MA Digital Humanities)

How many meanings does a word really have?

Polysemy, i.e., the property of a word of having more than one sense, is the rule rather than the exception. The degree of polysemy can be inferred in many ways: by counting the number of entries in a lexicon, by analyzing the set of a word's semantic neighbors, or by clustering contexts that a word surfaces in based on some semantic representations (e.g., BERT sentence embeddings). However, very often two senses are very similar to each other - suggesting the question of whether they are separate senses in the first place. What is more, some of a word's senses are typically more common than others. Hence, the question is: what is the actual number of senses that an individual perceives? 

The goal of this project is to crowd-source subjective estimates of the number of senses for a list of several thousand English and/or German words. Based on the collected data, the perceived degree of polysemy will be computed. 

Laura Kristen (MA-thesis project, MA Lehramt German & History)

Gender-inclusive language in the Austrian Parliament

My master-thesis deals with the investigation of the use of gender-inclusive language in the speeches of the Austrian Parliament within a defined period of time. The main aim of this work is to determine the proportion of speeches potentially affected by gender-inclusive language and, furthermore, to analyze the extent to which parliamentary debates reflect gender inequalities.

Jona Hassenbach (MA-thesis project, MA Digital Humanities)

Reception Through Time

In literary history, there are few figures who have been received as frequently as classical characters. However, evaluating a character’s reception history often depends on the person doing the Interpretation and can thus be limited by their individual understanding of language. While comparing different interpretations is one way to address this problem, I want to try a different approach: a diachronic Emotion analysis using word embeddings from different time periods along with the VAD (Valence-Arousal-Dominance) emotional model. In this way, the resulting VAD scores should better reflect how a text judged a Character using the language of its own time. By comparing works from different periods but centered around the same group of classical women, I hope togain new insights into their reception history.

Max Tiessler (Interdisciplinary Project in Data Science, MSc Data Science, TU Wien)

Tracking Word Meanings and Emotions Over Time

Recent computational linguistics research tracks semantic and emotional changes in words using techniques like word embeddings and sense modeling. However, these aspects are often studied separately, limiting insight into how word meanings evolve emotionally over time. To address this, we integrate diachronic sense-tracking with Valence-Arousal-Dominance (VAD) emotion modeling. Emotion models infer VAD scores from dictionary sense definitions, which are then weighted by decade-specific sense distributions to generate emotional trajectories for each word. This unified approach offers a richer, time-sensitive view of language change, capturing both semantic and emotional evolution more holistically.

The research conducted in this project was published in the proceedings of Computational Humanities Research 2025: https://doi.org/10.63744/tdBQckiQA3FI

Maximilian Berens (MA-thesis project, MA Digital Humanities)

Enhancing authorship attribution? Analysing the impact of emotional language in authorshipt attribution

Authorship attribution seeks to identify the writer of an anonymous text using various methods, now advanced by computational tools like natural language processing, machine learning, and AI. A growing research focus is emotional attribution—analyzing emotional language to uncover unique writing patterns. By integrating emotional tone as a variable, this approach could enhance traditional methods and improve accuracy in identifying authors. The core research question of this thesis is whether emotional features in texts offer reliable clues about authorship. If successful, this method could provide deeper insights into individual writing styles and expand how researchers understand and differentiate authorial voice in text analysis.

Hannes Essfors (MA-thesis project, MA Digital Humanities)

Measuring Global Linguistic Diversity - Incorporating Similarity/Distance Measures using large scale Typoligical Databases

This thesis explores how linguistic diversity can be more accurately quantified by integrating interlinguistic distance measures. While previous studies, such as Bromham et al. (2022, Nature Ecology & Evolution), have predicted major language loss using statistical models, they often overlook digitization and similarity between languages. This thesis addresses these gaps using global linguistic databases like PHOIBLE, Grambank, ASJP, and URIEL+. By combining these with Ethnologue speaker data, the project calculates both naive and similarity-weighted diversity measures. It aims to assess data availability, biases, and correlation across phonemic, syntactic, lexical, and other diversity types.

Katharina Zeh (MA-thesis project, MA Digital Humanities)

The Impact of Digitization on Language Diversity and Endangerment: A Statistical Cluster and Correlation Analysis Using R

Linguistic diversity is declining rapidly, with nearly half of the world’s 7,000 languages endangered and predictions suggesting up to 90% may disappear by the end of this century. While socio-political, economic, and environmental drivers of language loss are well-documented, the impact of digitization remains understudied. Digitization, a force behind globalization, exacerbates inequalities and contributes to a “digital language divide,” favoring dominant global languages like English. However, it also offers tools for language preservation through apps, AI, and digital platforms. This thesis explores digitization’s dual role by analyzing how digitization indices correlate with linguistic diversity measures, aiming to uncover nuanced patterns and implications.

Hannes Essfors (Data analysis project, WS2024, MA Digital Humanities)

Sociophonetic variation in Afrikaans vowel production

This project was about a dataset consisting of acoustic features of vowels (first and second formant) that have been produced by white and colored speakers of Afrikaans, a Germanic language that is spoken (mainly) in South Africa. The data have been recorded using two different methods (word lists vs. speech in context) by Daan Wissing (North West University, Potchefstroom, South Africa). Acoustic features have been already extracted for all configurations. The goal of the project was to compare the different configurations to assess (a) whether the examined sociolinguistic variants of Afrikaans differ from each other and (b) to what extent results based on different methodologies match.

Marina Sommer (Interdisciplinary Project in Data Science, MSc Data Science, TU Wien)

An analysis of the development of the German touch verbs ‘anfassen’, ‘angreifen’, ‘anlangen’ with text data from Common Crawl

The aim of my project was to find out if the usage of the German touch verbs "anfassen", "angreifen" and "anlangen" has changed over the last decade. The main focus was on the exploration of the unique data repository and data format of the platform Common Crawl.

Lale Tüver & Katharina Zeh (Data analysis project, SS2024, MA Digital Humanities)

Linguistic Diversity in the Digital Age: Exploring the Effect of Digital Literacy on Minority Languages

Global linguistic diversity has declined, but the mechanisms behind this trend remain unclear. While past research focused on socioeconomic factors, this study examines the role of digital literacy. Limited digital proficiency is hypothesized to marginalize minority languages online. To test this, we compiled global language data and used Shannon entropy to quantify linguistic diversity, analyzing links with digital and demographic indicators. Our findings show that internet access supports linguistic diversity, while education levels negatively impact it. However, digital skills had no significant effect. The study contributes to discussions on language endangerment and suggests directions for future research.

Martin Miesbauer (Interdisciplinary Project in Data Science, MSc Data Science, TU Wien)

The role of linguistically encoded emotional characteristics for cooperativeness in the Zurich Tangram Corpus

Research suggests that emotions correlate positively with cooperation in collaborative tasks. This study explores whether emotions can predict cooperativeness using a dataset of cooperative interactions. Emotional states are defined by three dimensions: valence (negative-positive), arousal (calm-agitated), and dominance (submissive-dominant). The study examines the importance of these factors in predicting cooperativeness and analyzes the impact of different measures. Specifically, it focuses on predicting task completion time, which is inversely related to cooperativeness. The findings aim to enhance understanding of how emotional states influence teamwork and task efficiency.