Center for Uncertainty Studies Blog - Kategorie Digital Academy
Digital Academy 2023: Exploring Uncertainty in Toponyms within the British Colonial Corpus
From September 25 to 28, 2023, the Digital History Working Group at Bielefeld University welcomed participants to the Digital Academy, themed "From Uncertainty to Action: Advancing Research with Digital Data." This event delved into the complexities of data-based research, exploring strategies to navigate uncertainties within the Digital Humanities. In a series of blog posts, four attendees of the workshop program share insights into their work on data collections and analysis and reflect on the knowledge gained from the interdisciplinary discussions at the Digital Academy. Learn more about the event visiting the Digital Academy Website.
Exploring Uncertainty in Toponyms within the British Colonial Corpus
by Shanmugapriya T
My research project aims to extract toponyms from the British India colonial corpus to create a historical gazetteer. The primary challenge in this work revolves around the toponyms themselves, as they exhibit a high degree of fuzziness and inconsistency, particularly in their spellings. Historically, mapping, documenting, and surveying have been recognized as essential tools employed by colonial powers to demarcate, expand, and exert control over their colonial subjects. These activities enabled the colonial administration to establish governance over land and streamline revenue collection during the British colonial period. As time progressed, surveys expanded beyond their initial military and geographical purposes, evolving into comprehensive sources of information encompassing geography, political economy, and natural history. The British colonial India corpus is, therefore, intricate, marked by non-standard formatting, and plagued by inconsistencies in the spelling of Indian toponyms. This intricacy adds an extra layer of complexity to the task of extracting and organizing these toponyms for the creation of a historical gazetteer. The recognition of these challenges underscores the importance of using advanced techniques and tools to handle the uncertainty inherent in this historical data.
Digital Humanities methods and tools
The first and foremost challenge is the absence of a trained dataset of Indian place names. I need to focus on creating a trained dataset using Named Entity Recognition and other external open-access resources, such as Wikipedia. The second challenge pertains to the advanced programming techniques that I am experimenting with. The initial experiment with BERT NER for identifying toponym entities demonstrates that the algorithm performs well compared to other NER libraries. However, it also identified a few words that are not toponyms as place names and did not identify the broken toponym words as place names. Therefore, the extracted place name entities will require manual verification to confirm their accuracy. I anticipate encountering additional challenges when I begin exploring DeezyMatch, as I am currently in the initial stages of my research.
Digital Academy workshop on uncertainty
The Digital Academy workshop presented a fantastic opportunity for scholars like myself to convene and discuss a wide array of challenges, approaches, methods, and tools for addressing uncertainty. The inclusion of experts in the field of uncertainty was a valuable aspect of this workshop, enabling attendees to solicit advice and feedback on the challenges they face in their research. Although I was not able to attend the entire workshop, the workshop's theme serves as a motivating factor for me to persist in my research endeavors despite the numerous challenges I've encountered. I believe that ongoing discussions and collaboration within the academic community will be instrumental in finding effective solutions to these challenges and further advancing the field.
Questions remain open
The open questions revolve around the ideal size of the corpus required for applying the aforementioned advanced techniques and the expected effectiveness of the trained dataset. However, I am hopeful that I will be able to find answers to these questions in the near future.
References
Devlin, Jacob, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” North American Chapter of the Association for Computational Linguistics (2019). Accessed October 5, 2023. https://arxiv.org/pdf/1810.04805v2.
Hosseini, Kasra, Federico Nanni, and Mariona Coll Ardanuy. “DeezyMatch: A Flexible Deep Learning Approach to Fuzzy String Matching.” Paper presented at the Empirical Methods in Natural Language Processing: System Demonstrations, Online, October 2020. https://aclanthology.org/2020.emnlp-demos.9. Accessed October 5, 2023.
Biographical note
Digital Academy 2023: Catrina Langenegger about Swiss Military Refugee Camps
From September 25 to 28, 2023, the Digital History Working Group at Bielefeld University welcomed participants to the Digital Academy, themed "From Uncertainty to Action: Advancing Research with Digital Data." This event delved into the complexities of data-based research, exploring strategies to navigate uncertainties within the Digital Humanities. In a series of blog posts, four attendees of the workshop program share insights into their work on data collections and analysis and reflect on the knowledge gained from the interdisciplinary discussions at the Digital Academy. Learn more about the event visiting the Digital Academy Website.
Historical Map of Switzerland.
by Catrina Langenegger
I now come back to the missing reports mentioned above. My goal is to be transparent about this gap. However, making this gap visible in statistics and visualisations is one of the greatest challenges when dealing with uncertainty. Statistics and visualisations are positivistic: they only show what is there. In the first statistics, the gaps weren’t visible. I therefore made artificial observations in my dataset with a zero as value to mark the gaps. In other words, I made the missing weekly reports visible by creating an observation for each of these dates. I have labelled these artificial observations as such. My data model now provides a field to mark whether there is a report for the week or not. Nevertheless, it’s almost impossible to visualise the weeks without information. Although I have made artificial entries in my dataset, these are not displayed in the visualizations because they do not contain a value.
fig. 1: Timeline with missing data
fig. 2: Auto-corrected timeline
The software I use calculates out all uncertain data and provides the average. I found a way to work around this by only using the edit mode, even for my visualisations because in the viewing mode, the observations inserted by me to show the uncertainty will be removed. In both examples, I was able to incorporate the uncertainty into the data via a categorisation in my data model. In this way, I also hope that my data can be better reused, as it makes transparent statements about its own quality.
Catrina Langenegger recently submitted her PhD thesis on refugee camps under military control in Switzerland during the Second World War. She conducts her research at the Centre for Jewish Studies at the University of Basel. As a historian with a focus on digital humanities she exercises her passion for data also in her role as subject librarian with a background in library and information sciences.
References:
1. Cf. Karten der Schweiz - Schweizerische Eidgenossenschaft - map.geo.admin.ch: https://map.geo.admin.ch/?topic=swisstopo&lang=de&bgLayer=ch.swisstopo.pixelkarte-farbe&catalogNodes=1392&layers=ch.swisstopo.zeitreihen&time=1864&layers_timestamp=18641231.
Meet ... Jens Zinn
Jens Zinn is Tr Ashworth Associate Professor in Sociology Social and Political Sciences at The University of Melbourne and CeUS Member.
What connects you to Bielefeld University?
Kategorie Hinweis
Auf dieser Seite werden nur die der Kategorie Digital Academy zugeordneten Blogeinträge gezeigt.
Wenn Sie alle Blogeinträge sehen möchten klicken Sie auf: Startseite