Blog CRC1646

Veröffentlicht am 26. August 2025

Kategorie: Forschung

Tags: a02 conference sfb1646

Conference Review: Project A02 Presented the First German BabyLM Corpus at CoNLL 2025 in Vienna

A02 researchers Bastian Bunzeck, Daniel Duran and Sina Zarrieß attended the 29^th Conference on Computational Natural Language Learning (CoNLL 2025) in Vienna from July 31 to August 1 and presented the world’s first German BabyLM corpus. CoNLL is one of the premier venues for work on theoretically grounded and cognitively motivated approaches to computational language learning (with an acceptance rate below 20%!).

As such, it fits directly with one of the main aims of A02 — creating computational models of creative pronunciation variation. To do so, however, first adequate data is needed. In a pioneer effort, Bastian, Daniel and Sina created a dataset of developmentally plausible training data in German. With this data, they investigated which linguistic structures a #BabyLM can learn from a relatively small amount of input data: Words? Syntax? Semantics? As it turns out, simpler child-directed language is conducive to lexical benchmarks, but less so for syntax.

During the poster presentation, Bastian, Daniel and Sina received inspiring feedback, especially as to ways in which such models could be made even more developmentally plausible — e.g. by incorporating raw speech signals instead of text. In sum, the general theme of CoNLL aligned perfectly with the work in A02, and new ideas about phonetic and phonological properties in language models led to interesting discussions with colleagues from Cambridge and Amsterdam.

Conference CoNLL 2025

Abstract of the Poster

Projekt A02

Back row from left to right: Bastian Bunzeck, Joana Cholin, Leonie Schade & Daniel Duran

Front row from left to right: Petra Wagner & Sina Zarrieß

« Zurück zur Übersicht