Entropy–Distance Approach to Evaluating Diversity and Robustness in Organizational Information Retrieval
Abstract
Information retrieval constitutes a critical component of organizational information management, directly affecting the efficiency, accuracy, and resilience of decision-making processes. Conventional evaluation metrics—such as precision or click-through rates—do not adequately capture the lexical and semantic diversity of retrieved content, limiting their utility in managerial contexts where both relevance and variety are essential. This study introduces a scalable, language-agnostic entropy–distance framework designed to assess the robustness of retrieval systems under controlled linguistic variation. The framework integrates Shannon entropy, to quantify lexical diversity, with semantic dispersion measures derived from SBERT embeddings, enabling joint evaluation of breadth and coherence in search outputs. Using a curated 6.6M-article Wikipedia corpus, topics were clustered, summarized, and reformulated into paraphrased queries, which were executed across Google, Bing, and DuckDuckGo. The resulting outputs reveal significant differences in diversity–coherence trade-offs between platforms, with DuckDuckGo exhibiting the highest adaptability to query variation. The proposed methodology supports information governance by providing an unsupervised, reproducible metric that enables comparative auditing of search performance in enterprise and public domains. The findings offer actionable insights for optimizing retrieval strategies, mitigating systemic bias, and enhancing the resilience of organizational search infrastructures.
Keywords
entropy; search engine evaluation; paraphrase robustness; semantic dispersion; information retrieval; query variability