Towards Inclusive Fact-Checking: Claim Verification in English, Hindi, Bengali, and Code-Mixed Languages
Abstract
Automated claim verification has gained significant attention in recent years due to the widespread dissemination of misinformation across various digital platforms. While substantial progress has been made for high-resource languages like English, claim verification for low-resource languages and specifically for Code-Mixed texts remains largely unexplored in a multilingual country like India. In the present work, we introduce a novel multilingual dataset for claim verification, covering English, Hindi, Bengali, and Hindi-English Code-Mixed languages. The dataset is developed by engaging large language models (LLMs) as well as human annotators. The dataset contains claims, evidence passages, and veracity labels (\textit{SUPPORTS} or \textit{REFUTES}) on news headlines collected from three important domains (Politics, Healthcare, Law and Order). We proposed a rule-based baseline algorithm and a dual-encoder framework based on transformer models to effectively verify claims across diverse languages. Our results show that XLM-RoBERTa achieves the best performance for English and Code-Mix texts, while IndicBERTv2 outperforms for Hindi and Bengali, respectively. This study highlights the challenges and opportunities in multilingual and Code-Mixed claim verification, offering a step towards building inclusive, language-diverse fact-checking systems even for low resource setup.
Keywords
Claim verification, fact checking, low-resourced language, prompt engineering