Blog CRC1646
Conference Review: Project A02 Presented the First German BabyLM Corpus at CoNLL 2025 in Vienna
A02 researchers Bastian Bunzeck, Daniel Duran and Sina Zarrieß attended the 29th Conference on Computational Natural Language Learning (CoNLL 2025) in Vienna from July 31 to August 1 and presented the world’s first German BabyLM corpus. CoNLL is one of the premier venues for work on theoretically grounded and cognitively motivated approaches to computational language learning (with an acceptance rate below 20%!).
As such, it fits directly with one of the main aims of A02 — creating computational models of creative pronunciation variation. To do so, however, first adequate data is needed. In a pioneer effort, Bastian, Daniel and Sina created a dataset of developmentally plausible training data in German. With this data, they investigated which linguistic structures a #BabyLM can learn from a relatively small amount of input data: Words? Syntax? Semantics? As it turns out, simpler child-directed language is conducive to lexical benchmarks, but less so for syntax.
During the poster presentation, Bastian, Daniel and Sina received inspiring feedback, especially as to ways in which such models could be made even more developmentally plausible — e.g. by incorporating raw speech signals instead of text. In sum, the general theme of CoNLL aligned perfectly with the work in A02, and new ideas about phonetic and phonological properties in language models led to interesting discussions with colleagues from Cambridge and Amsterdam.
Bastian Bunzeck in Front of the Poster © Daniel Duran
Project A02 © Sascha Hermannski
Back row from left to right: Bastian Bunzeck, Joana Cholin, Leonie Schade & Daniel Duran
Front row from left to right: Petra Wagner & Sina Zarrieß