Student projects
The following Masters/PhD students are currently working on a project (co-)supervised by Andreas Baumann - MA, PhD, Data Analysis Project (DAP) (Uni Wien), or Interdisciplinary Project Data Science (IPDS) (TU Wien) - in the field of Digital Linguistics.
A list of announced/open project topics can be found below as well. Feel free to contact me in case you are interested.
Important information for MA projects (in German): Hinweise zu Abschlussarbeiten in der Digitalen Linguistik
Project announcements
MA-thesis, DAP
Dynamics of polysemy in German
In a recent study (Baumann et al. 2023, EMNLP), we found, based on English diachronic corpus data, that concrete words are more likely to become polysemous than abstract words; a plausible mechanism for this is that concrete words are more likely to yield metaphorical extensions than abstract words. The goal of this project is to test if the same relationship holds in German. For this, contexts of words will be extracted from DWDS (https://www.dwds.de/d/korpora/kern) for all decades in the 19th and 20th century, and subsequently analyzed via BERT-based sentence embeddings to derive a measure of increase/decrease in polysemy. Finally, this measure will be correlated with lexical concreteness ratings (Charbonnier & Wartena 2020, SWISSTEXT).
Prerequisites: R or Python, text processing, ideally some familiarity with BERT, statistical analysis
MA-thesis, DAP, IPDS
Digital language death 2.0
There are approximately 7,000 languages spoken worldwide, many of which face varying degrees of endangerment. In his seminal paper “Digital Language Death”, Kornai (2013, PLOS One, 8(10), e77056) explored the relationship between language stability and the extent to which languages are represented in digital spaces. One of the key findings was that the vitality of a language—measured by the number of speakers, the volume of Wikipedia articles available in that language, and the level of institutional support it receives—are positively correlated. Since the publication of this study over a decade ago, the digital landscape has undergone significant transformation, raising new questions about the evolving relationship between language vitality and digital representation.
In this project, students will revisit and partially replicate Kornai’s (2013) study using updated data from Wikipedia (to be collected during the project) and Ethnologue, a comprehensive language database that provides information on language endangerment levels and speaker populations. Students will employ a combination of web-scraping, information extraction, and statistical modeling techniques to analyze the data. By doing so, they will assess whether the patterns observed by Kornai a decade ago still hold in today’s rapidly changing digital environment and explore potential shifts in the dynamics of language endangerment and digital representation. This project will not only provide insights into the current state of linguistic diversity but also equip students with practical skills in data collection, analysis, and interpretation.
Prerequisites: Python or R; ideally some experience with text processing, web-scraping (Wikipedia API), statistical analysis
This project is part of the DIGILINGDIV project. Students are expected to participate in project meetings on a regular basis.
MA-thesis, DAP, IPDS
Lexical change in a multilingual diachronic text corpus
Languages are constantly subject to change. This is most clearly visible in the words that we use. While some words enter the lexicon (i.e., the shared vocabulary of a speech community) and become more frequent, others get less frequent and might even vanish. But do these processes apply uniformly across languages and do words spread faster in languages with more strongly connected speech communities? The goal of this project is to examine rates of lexical change in different languages. For this, a size-balanced diachronic multilingual corpus is first constructed by sampling texts from Wikipedia. In a second step, rates of lexical change and fluctuation are measured through statistical analysis in all languages. Finally, measures of how strongly speakers are connected to each other (e.g., population density) will be correlated with the average rates of lexical change per language.
Prerequisites: R or Python, ideally experience with text processing and the Wikipedia API, statistical analysis
This project is part of the DIGILINGDIV project. Students are expected to participate in project meetings on a regular basis.
MA-thesis
Epidemiological compartment models in discrete time for studying language change
Compartmental ODE models established in the field of epidemiology have been used to study the diffusion of linguistic innovations, such as the spread of new words. Most approaches employ models in continuous time (Nowak 2000, J. Math. Biol., and research of my own). The goal of this project is to examine to what extent compartment models in discrete time (i.e., systems of difference equations) can be reasonably transferred from epidemiology to the study of language change. One interesting feature of some discrete-time models studied in epidemiology (in particular, SIS models) is that they can display periodic behavior, i.e., periodically fluctuating trajectories (Allen 1994, Mathematical Biosciences). Do we find such behavior in linguistic dynamics and can discrete-time models be meaningfully employed to account for it?
Prerequisites: mathematics of dynamical systems; familiarity with R, Python, Mathematica, Matlab, or similar; maybe a bit of diachronic corpus analysis and/or time series analysis
Current projects
Laura Kristen (MA-thesis project, MA Lehramt German & History)
Gender-inclusive language in the Austrian Parliament
My master-thesis deals with the investigation of the use of gender-inclusive language in the speeches of the Austrian Parliament within a defined period of time. The main aim of this work is to determine the proportion of speeches potentially affected by gender-inclusive language and, furthermore, to analyze the extent to which parliamentary debates reflect gender inequalities.
Antonia Röper (MA-thesis project, MA Lehramt German)
Gaming and reading competence
This thesis explores the connection between video gaming and digital reading skills, focusing on 13-year-old students. As digital literacy becomes essential, gaming—often seen as mere entertainment—is recognized for developing key e-literacy skills such as navigation, problem-solving, and information processing. Previous research suggests that gamers, especially those who play single-player games, may outperform non-gamers in digital reading due to enhanced navigation and comprehension skills. The study aims to compare reading performance and digital navigation between gamers and non-gamers to assess this link.
Hannes Essfors (MA-thesis project, MA Digital Humanities)
Measuring Global Linguistic Diversity - Incorporating Similarity/Distance Measures using large scale Typoligical Databases
This thesis explores how linguistic diversity can be more accurately quantified by integrating interlinguistic distance measures. While previous studies, such as Bromham et al. (2022, Nature Ecology & Evolution), have predicted major language loss using statistical models, they often overlook digitization and similarity between languages. This thesis addresses these gaps using global linguistic databases like PHOIBLE, Grambank, ASJP, and URIEL+. By combining these with Ethnologue speaker data, the project calculates both naive and similarity-weighted diversity measures. It aims to assess data availability, biases, and correlation across phonemic, syntactic, lexical, and other diversity types.
Katharina Zeh (MA-thesis project, MA Digital Humanities)
The Impact of Digitization on Language Diversity and Endangerment: A Statistical Cluster and Correlation Analysis Using R
Linguistic diversity is declining rapidly, with nearly half of the world’s 7,000 languages endangered and predictions suggesting up to 90% may disappear by the end of this century. While socio-political, economic, and environmental drivers of language loss are well-documented, the impact of digitization remains understudied. Digitization, a force behind globalization, exacerbates inequalities and contributes to a “digital language divide,” favoring dominant global languages like English. However, it also offers tools for language preservation through apps, AI, and digital platforms. This thesis explores digitization’s dual role by analyzing how digitization indices correlate with linguistic diversity measures, aiming to uncover nuanced patterns and implications.
Maximilian Berens (MA-thesis project, MA Digital Humanities)
Enhancing authorship attribution? Analysing the impact of emotional language in authorshipt attribution
Authorship attribution seeks to identify the writer of an anonymous text using various methods, now advanced by computational tools like natural language processing, machine learning, and AI. A growing research focus is emotional attribution—analyzing emotional language to uncover unique writing patterns. By integrating emotional tone as a variable, this approach could enhance traditional methods and improve accuracy in identifying authors. The core research question of this thesis is whether emotional features in texts offer reliable clues about authorship. If successful, this method could provide deeper insights into individual writing styles and expand how researchers understand and differentiate authorial voice in text analysis.
Jona Hassenbach (MA-thesis project, MA Digital Humanities)
Reception Through Time
In literary history, there are few figures who have been received as frequently as classical characters. However, evaluating a character’s reception history often depends on the person doing the Interpretation and can thus be limited by their individual understanding of language. While comparing different interpretations is one way to address this problem, I want to try a different approach: a diachronic Emotion analysis using word embeddings from different time periods along with the VAD (Valence-Arousal-Dominance) emotional model. In this way, the resulting VAD scores should better reflect how a text judged a Character using the language of its own time. By comparing works from different periods but centered around the same group of classical women, I hope togain new insights into their reception history.
Sarah Bloos (MA-thesis project, MA Digital Humanities)
Do you speak “Grant”?
The Viennese Grant is perhaps the most popular sociolinguistic stereotype about the city – but there’s only little known about how it’s really perceived. Collecting data from participants from Austria, Germany and Switzerland, I’m attempting to capture Grant using a dimensional emotion model (VAD). Further interest lies on possible correlations of sociolinguistic variables like age or gender and the respective perception of Grant, eventually leading to observing culturally or demographically varying clusters of understanding.
Markus Pluschkovits (PhD project, co-supervised with Alexandra Lenz)
Realizations of the Progressive Aspect in German: Form, Function and Variation
This dissertation project is concerned with the different realizations of progressive aspectuality in contemporary German. Taking a cognitive and sociolinguistic approach, the aim of the project is to use quantitative methodology to investigate the steering factors behind the choice of specific constructions to encode actions being in progress.
Claudia Mattes (PhD project, co-supervised with Alexandra Lenz)
The gehören-passive. A corpus linguistic approach to the analytic construction gehören + participle II
The non-canonical passive form, comprised of gehören and the past participle of a verb, hasn’t been extensively researched so far. With the approach through digital methods in different corpora, the aim of this thesis is to better understand the construction in its different aspects, namely its development, the current grammaticalization and the semantic-pragmatic usage.
Past projects
Hannes Essfors (Data analysis project, WS2024, MA Digital Humanities)
Sociophonetic variation in Afrikaans vowel production
This project was about a dataset consisting of acoustic features of vowels (first and second formant) that have been produced by white and colored speakers of Afrikaans, a Germanic language that is spoken (mainly) in South Africa. The data have been recorded using two different methods (word lists vs. speech in context) by Daan Wissing (North West University, Potchefstroom, South Africa). Acoustic features have been already extracted for all configurations. The goal of the project was to compare the different configurations to assess (a) whether the examined sociolinguistic variants of Afrikaans differ from each other and (b) to what extent results based on different methodologies match.
Marina Sommer (Interdisciplinary Project in Data Science, MSc Data Science, TU Wien)
An analysis of the development of the German touch verbs ‘anfassen’, ‘angreifen’, ‘anlangen’ with text data from Common Crawl
The aim of my project was to find out if the usage of the German touch verbs "anfassen", "angreifen" and "anlangen" has changed over the last decade. The main focus was on the exploration of the unique data repository and data format of the platform Common Crawl.
Lale Tüver & Katharina Zeh (Data analysis project, SS2024, MA Digital Humanities)
Linguistic Diversity in the Digital Age: Exploring the Effect of Digital Literacy on Minority Languages
Global linguistic diversity has declined, but the mechanisms behind this trend remain unclear. While past research focused on socioeconomic factors, this study examines the role of digital literacy. Limited digital proficiency is hypothesized to marginalize minority languages online. To test this, we compiled global language data and used Shannon entropy to quantify linguistic diversity, analyzing links with digital and demographic indicators. Our findings show that internet access supports linguistic diversity, while education levels negatively impact it. However, digital skills had no significant effect. The study contributes to discussions on language endangerment and suggests directions for future research.
Martin Miesbauer (Interdisciplinary Project in Data Science, MSc Data Science, TU Wien)
The role of linguistically encoded emotional characteristics for cooperativeness in the Zurich Tangram Corpus
Research suggests that emotions correlate positively with cooperation in collaborative tasks. This study explores whether emotions can predict cooperativeness using a dataset of cooperative interactions. Emotional states are defined by three dimensions: valence (negative-positive), arousal (calm-agitated), and dominance (submissive-dominant). The study examines the importance of these factors in predicting cooperativeness and analyzes the impact of different measures. Specifically, it focuses on predicting task completion time, which is inversely related to cooperativeness. The findings aim to enhance understanding of how emotional states influence teamwork and task efficiency.