Center for Uncertainty Studies Blog

Published on 2. Mai 2024

Category Digital Academy

Tags: digital history research uncertainty

Digital Academy 2023: Exploring Uncertainty in Toponyms within the British Colonial Corpus

From September 25 to 28, 2023, the Digital History Working Group at Bielefeld University welcomed participants to the Digital Academy, themed "From Uncertainty to Action: Advancing Research with Digital Data." This event delved into the complexities of data-based research, exploring strategies to navigate uncertainties within the Digital Humanities. In a series of blog posts, four attendees of the workshop program share insights into their work on data collections and analysis and reflect on the knowledge gained from the interdisciplinary discussions at the Digital Academy. Learn more about the event visiting the Digital Academy Website.

Exploring Uncertainty in Toponyms within the British Colonial Corpus

by Shanmugapriya T

My research project aims to extract toponyms from the British India colonial corpus to create a historical gazetteer. The primary challenge in this work revolves around the toponyms themselves, as they exhibit a high degree of fuzziness and inconsistency, particularly in their spellings. Historically, mapping, documenting, and surveying have been recognized as essential tools employed by colonial powers to demarcate, expand, and exert control over their colonial subjects. These activities enabled the colonial administration to establish governance over land and streamline revenue collection during the British colonial period. As time progressed, surveys expanded beyond their initial military and geographical purposes, evolving into comprehensive sources of information encompassing geography, political economy, and natural history. The British colonial India corpus is, therefore, intricate, marked by non-standard formatting, and plagued by inconsistencies in the spelling of Indian toponyms. This intricacy adds an extra layer of complexity to the task of extracting and organizing these toponyms for the creation of a historical gazetteer. The recognition of these challenges underscores the importance of using advanced techniques and tools to handle the uncertainty inherent in this historical data.

Digital Humanities methods and tools

Dealing with fuzzy toponyms requires the application of specific and advanced techniques. In this context, I utilize digital humanities methods and tools to identify and extract these toponyms from the British India colonial corpus. Indian toponyms in the British colonial corpus often exhibit various spellings, such as "Noil", "Noyal", "Noyyal", "Bawani", "Bhawani" and "Bowani," representing different variations of river and place names in the Southern region of India. To address this challenge, I conducted an exploration of the corpus. My approach involved leveraging an English word database, employing regular expressions, using natural language processing module Spacy for customized entities, and utilizing other relevant Python libraries to extract transliterated words from the corpus. Additionally, I developed a user interface using programming languages HTML, CSS and JavaScript. I used an open access database MySQL to store the data and PHP for interactive and management of the data. Finally, I employed Geographic Information System (GIS) tool ArcGIS to filter, map, and tag the toponyms and other entities within the dataset. While these initial experiments contributed to theoretical considerations and raised awareness of the complexities inherent in studying the British colonial corpus, the employed method did not entirely resolve the challenge of extracting toponyms. It also inadvertently filtered out misspelled and non-contemporary English words, along with the targeted toponyms.

The new method I propose involves three distinct stages. The first stage centers on the identification of entities using advanced natural language processing module BERT Named Entity Recognition (NER) (Devlin et al. 2018) to create a trained dataset on place names. This NER system is instrumental in locating hidden toponyms and learning from contextual information. The second stage is dedicated to the extraction of fuzzy toponyms, for which I employ advanced natural language processing module DeezyMatch (Hosseini et al. 2020). DeezyMatch is specifically designed for fuzzy string matching and toponym extraction. To generate the training dataset for string pairs, I also collect alternate names of places in South India. By learning similar transformations as those present in the training set, DeezyMatch should be capable of applying this knowledge to unseen variations of toponyms. Subsequently, I use the cleaned dataset to determine optimal hyperparameters for specific scenarios, such as finding the ideal thresholds for matching. In the final stage, I create a database for the historical gazetteer and integrate it with the World Historical Gazetteer. This integration is significant as it offers a wide range of content and services that empower global historians, their students, and the general public to engage in spatial and temporal analysis and visualization within a data-rich environment, spanning global and trans-regional scales (“Introducing the World Historical Gazetteer”). This enhances the accessibility and utility of the historical toponym data for a broad audience.

Main challenges

The first and foremost challenge is the absence of a trained dataset of Indian place names. I need to focus on creating a trained dataset using Named Entity Recognition and other external open-access resources, such as Wikipedia. The second challenge pertains to the advanced programming techniques that I am experimenting with. The initial experiment with BERT NER for identifying toponym entities demonstrates that the algorithm performs well compared to other NER libraries. However, it also identified a few words that are not toponyms as place names and did not identify the broken toponym words as place names. Therefore, the extracted place name entities will require manual verification to confirm their accuracy. I anticipate encountering additional challenges when I begin exploring DeezyMatch, as I am currently in the initial stages of my research.

Digital Academy workshop on uncertainty

The Digital Academy workshop presented a fantastic opportunity for scholars like myself to convene and discuss a wide array of challenges, approaches, methods, and tools for addressing uncertainty. The inclusion of experts in the field of uncertainty was a valuable aspect of this workshop, enabling attendees to solicit advice and feedback on the challenges they face in their research. Although I was not able to attend the entire workshop, the workshop's theme serves as a motivating factor for me to persist in my research endeavors despite the numerous challenges I've encountered. I believe that ongoing discussions and collaboration within the academic community will be instrumental in finding effective solutions to these challenges and further advancing the field.

Questions remain open

The open questions revolve around the ideal size of the corpus required for applying the aforementioned advanced techniques and the expected effectiveness of the trained dataset. However, I am hopeful that I will be able to find answers to these questions in the near future.

References

World Historical Gazetteer. “Introducing the World Historical Gazetteer.” Accessed October 10, 2023. https://whgazetteer.org/about/.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” North American Chapter of the Association for Computational Linguistics (2019). Accessed October 5, 2023. https://arxiv.org/pdf/1810.04805v2.

Hosseini, Kasra, Federico Nanni, and Mariona Coll Ardanuy. “DeezyMatch: A Flexible Deep Learning Approach to Fuzzy String Matching.” Paper presented at the Empirical Methods in Natural Language Processing: System Demonstrations, Online, October 2020. https://aclanthology.org/2020.emnlp-demos.9. Accessed October 5, 2023.

Biographical note

Shanmugapriya T is an Assistant Professor in the Department of Humanities and Social Sciences at the Indian Institute of Technology (Indian School of Mines) Dhanbad. She was a Digital Humanities Postdoctoral Scholar in the Department of Historical and Cultural Studies (HCS) at the University of Toronto Scarborough. Her expertise centers around the development and application of digital humanities methods and tools for historical and literary research in South Asia, particularly within the realms of colonial and postcolonial studies. She has a specific focus on areas such as text mining, digital mapping, and the creation of digital creative visualizations.

Visit the personal website: https://www.shanmugapriya.com/

« Back to overview