Hansa Meghwani, Amit Agarwal, Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Srikant Panda
Enterprise search systems often struggle to retrieve accurate,
domain-specific information due to semantic mismatches and overlapping
terminologies. These issues can degrade the performance of downstream
applications such as knowledge management, customer support, and
retrieval-augmented generation agents. To address this challenge, we propose a
scalable hard-negative mining framework tailored specifically for
domain-specific enterprise data. Our approach dynamically selects semantically
challenging but contextually irrelevant documents to enhance deployed
re-ranking models.
Our method integrates diverse embedding models, performs dimensionality
reduction, and uniquely selects hard negatives, ensuring computational
efficiency and semantic precision. Evaluation on our proprietary enterprise
corpus (cloud services domain) demonstrates substantial improvements of 15\% in
MRR@3 and 19\% in MRR@10 compared to state-of-the-art baselines and other
negative sampling techniques. Further validation on public domain-specific
datasets (FiQA, Climate Fever, TechQA) confirms our method's generalizability
and readiness for real-world applications.
Authors' comments: Accepted to ACL 2025
Kuicai Dong, Yujing Chang, Shijie Huang, Yasheng Wang, Ruiming Tang, Yong Liu
Document Visual Question Answering (DocVQA) faces dual challenges in
processing lengthy multimodal documents (text, images, tables) and performing
cross-modal reasoning. Current document retrieval-augmented generation (DocRAG)
methods remain limited by their text-centric approaches, frequently missing
critical visual information. The field also lacks robust benchmarks for
assessing multimodal evidence selection and integration. We introduce MMDocRAG,
a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with
multi-page, cross-modal evidence chains. Our framework introduces innovative
metrics for evaluating multimodal quote selection and enables answers that
interleave text with relevant visual elements. Through large-scale experiments
with 60 VLM/LLM models and 14 retrieval systems, we identify persistent
challenges in multimodal evidence retrieval, selection, and integration.Key
findings reveal advanced proprietary LVMs show superior performance than
open-sourced alternatives. Also, they show moderate advantages using multimodal
inputs over text-only inputs, while open-source alternatives show significant
performance degradation. Notably, fine-tuned LLMs achieve substantial
improvements when using detailed image descriptions. MMDocRAG establishes a
rigorous testing ground and provides actionable insights for developing more
robust multimodal DocVQA systems. Our benchmark and code are available at
https://mmdocrag.github.io/MMDocRAG/.
Authors' comments: preprint. code available at
\url{https://mmdocrag.github.io/MMDocRAG/}
Pierre Achkar, Tim Gollub, Martin Potthast
The exponential growth of scientific publications has made it increasingly
difficult for researchers to stay updated and synthesize knowledge effectively.
This paper presents XSum, a modular pipeline for multi-document summarization
(MDS) in the scientific domain using Retrieval-Augmented Generation (RAG). The
pipeline includes two core components: a question-generation module and an
editor module. The question-generation module dynamically generates questions
adapted to the input papers, ensuring the retrieval of relevant and accurate
information. The editor module synthesizes the retrieved content into coherent
and well-structured summaries that adhere to academic standards for proper
citation. Evaluated on the SurveySum dataset, XSum demonstrates strong
performance, achieving considerable improvements in metrics such as CheckEval,
G-Eval and Ref-F1 compared to existing approaches. This work provides a
transparent, adaptable framework for scientific summarization with potential
applications in a wide range of domains. Code available at
https://github.com/webis-de/scolia25-xsum
Authors' comments: Accepted at SCOLIA@ECIR 2025 Workshop
Quentin Macé, António Loison, Manuel Faysse
The ViDoRe Benchmark V1 was approaching saturation with top models exceeding
90% nDCG@5, limiting its ability to discern improvements. ViDoRe Benchmark V2
introduces realistic, challenging retrieval scenarios via blind contextual
querying, long and cross-document queries, and a hybrid synthetic and
human-in-the-loop query generation process. It comprises four diverse,
multilingual datasets and provides clear evaluation instructions. Initial
results demonstrate substantial room for advancement and highlight insights on
model generalization and multilingual capability. This benchmark is designed as
a living resource, inviting community contributions to maintain relevance
through future evaluations.
Authors' comments: Published as a HuggingFace Blog
Siting Li, Xiang Gao, Simon Shaolei Du
While an image is worth more than a thousand words, only a few provide
crucial information for a given task and thus should be focused on. In light of
this, ideal text-to-image (T2I) retrievers should prioritize specific visual
attributes relevant to queries. To evaluate current retrievers on handling
attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with
9,112 queries about diverse attributes of interest. We find that CLIP-like
retrievers, which are widely adopted due to their efficiency and zero-shot
ability, have poor and imbalanced performance, possibly because their image
embeddings focus on global semantics and subjects while leaving out other
details. Notably, we reveal that even recent Multimodal Large Language Model
(MLLM)-based, stronger retrievers with a larger output dimension struggle with
this limitation. Hence, we hypothesize that retrieving with general image
embeddings is suboptimal for performing such queries. As a solution, we propose
to use promptable image embeddings enabled by these multimodal retrievers,
which boost performance by highlighting required attributes. Our pipeline for
deriving such embeddings generalizes across query types, image pools, and base
retriever architectures. To enhance real-world applicability, we offer two
acceleration strategies: Pre-processing promptable embeddings and using linear
approximations. We show that the former yields a 15% improvement in Recall@5
when prompts are predefined, while the latter achieves an 8% improvement when
prompts are only available during inference.
Authors' comments: 25 pages, 5 figures
Nikolaos Chaidos, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Stamou
Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval.
Marc Allain, Selin Aslan, Wim Coene, Sjoerd Dirksen, Jonathan Dong, Julien Flamant, Mark Iwen, Felix Krahmer et al.
Phase retrieval is an inverse problem that, on one hand, is crucial in many applications across imaging and physics, and, on the other hand, leads to deep research questions in theoretical signal processing and applied harmonic analysis. This survey paper is an outcome of the recent workshop Phase Retrieval in Mathematics and Applications (PRiMA) (held on August 5--9 2024 at the Lorentz Center in Leiden, The Netherlands) that brought together experts working on theoretical and practical aspects of the phase retrieval problem with the purpose to formulate and explore essential open problems in the field.
Adarsh Singh, Kushal Raj Bhandari, Jianxi Gao, Soham Dan, Vivek Gupta
Table Question Answering (TQA) involves retrieving relevant tables from a large corpus to answer natural language queries. Traditional dense retrieval models, such as DTR and ColBERT, not only incur high computational costs for large-scale retrieval tasks but also require retraining or fine-tuning on new datasets, limiting their adaptability to evolving domains and knowledge. In this work, we propose $\textbf{CRAFT}$, a cascaded retrieval approach that first uses a sparse retrieval model to filter a subset of candidate tables before applying more computationally expensive dense models and neural re-rankers. Our approach achieves better retrieval performance than state-of-the-art (SOTA) sparse, dense, and hybrid retrievers. We further enhance table representations by generating table descriptions and titles using Gemini Flash 1.5. End-to-end TQA results using various Large Language Models (LLMs) on NQ-Tables, a subset of the Natural Questions Dataset, demonstrate $\textbf{CRAFT}$ effectiveness.
Yunjia Xi, Jianghao Lin, Menghui Zhu, Yongzhao Xiao, Zhuoying Ou, Jiaqi Liu, Tong Wan, Bo Chen et al.
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding responses with retrieved information. As an emerging paradigm, Agentic RAG further enhances this process by introducing autonomous LLM agents into the information seeking process. However, existing benchmarks fall short in evaluating such systems, as they are confined to a static retrieval environment with a fixed, limited corpus} and simple queries that fail to elicit agentic behavior. Moreover, their evaluation protocols assess information seeking effectiveness by pre-defined gold sets of documents, making them unsuitable for the open-ended and dynamic nature of real-world web environments. To bridge this gap, we present InfoDeepSeek, a new benchmark with challenging questions designed for assessing agentic information seeking in real-world, dynamic web environments. We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity. Based on this, we develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes. Through extensive experiments across LLMs, search engines, and question types, InfoDeepSeek reveals nuanced agent behaviors and offers actionable insights for future research.
Kai Yin, Xiangjue Dong, Chengkai Liu, Lipai Huang, Yiming Xiao, Zhewei Liu, Ali Mostafavi, James Caverlee
Effective disaster management requires timely access to accurate and contextually relevant information. Existing Information Retrieval (IR) benchmarks, however, focus primarily on general or specialized domains, such as medicine or finance, neglecting the unique linguistic complexity and diverse information needs encountered in disaster management scenarios. To bridge this gap, we introduce DisastIR, the first comprehensive IR evaluation benchmark specifically tailored for disaster management. DisastIR comprises 9,600 diverse user queries and more than 1.3 million labeled query-passage pairs, covering 48 distinct retrieval tasks derived from six search intents and eight general disaster categories that include 301 specific event types. Our evaluations of 30 state-of-the-art retrieval models demonstrate significant performance variances across tasks, with no single model excelling universally. Furthermore, comparative analyses reveal significant performance gaps between general-domain and disaster management-specific tasks, highlighting the necessity of disaster management-specific benchmarks for guiding IR model selection to support effective decision-making in disaster management scenarios. All source codes and DisastIR are available at https://github.com/KaiYin97/Disaster_IR.
Lei Li, Xiao Zhou, Zheng Liu
Current medical retrieval benchmarks primarily emphasize lexical or shallow
semantic similarity, overlooking the reasoning-intensive demands that are
central to clinical decision-making. In practice, physicians often retrieve
authoritative medical evidence to support diagnostic hypotheses. Such evidence
typically aligns with an inferred diagnosis rather than the surface form of a
patient's symptoms, leading to low lexical or semantic overlap between queries
and relevant documents. To address this gap, we introduce R2MED, the first
benchmark explicitly designed for reasoning-driven medical retrieval. It
comprises 876 queries spanning three tasks: Q&A reference retrieval, clinical
evidence retrieval, and clinical case retrieval. These tasks are drawn from
five representative medical scenarios and twelve body systems, capturing the
complexity and diversity of real-world medical information needs. We evaluate
15 widely-used retrieval systems on R2MED and find that even the best model
achieves only 31.4 nDCG@10, demonstrating the benchmark's difficulty. Classical
re-ranking and generation-augmented retrieval methods offer only modest
improvements. Although large reasoning models improve performance via
intermediate inference generation, the best results still peak at 41.4 nDCG@10.
These findings underscore a substantial gap between current retrieval
techniques and the reasoning demands of real clinical tasks. We release R2MED
as a challenging benchmark to foster the development of next-generation medical
retrieval systems with enhanced reasoning capabilities. Data and code are
available at https://github.com/R2MED/R2MED
Authors' comments: 38 pages, 16 figures
Jiaang Li, Yifei Yuan, Wenyan Li, Mohammad Aliannejadi, Daniel Hershcovich, Anders Søgaard, Ivan Vulić, Wenxuan Zhang et al.
As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 Wikipedia documents curated and ranked by human annotators. With RAVENEA, we train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the value of retrieval-augmented methods and culturally inclusive benchmarks for multimodal understanding.
Yoorhim Cho, Hongyeob Kim, Semin Kim, Youjia Zhang, Yunseok Choi, Sungeun Hong
Visuo-tactile perception aims to understand an object's tactile properties, such as texture, softness, and rigidity. However, the field remains underexplored because collecting tactile data is costly and labor-intensive. We observe that visually distinct objects can exhibit similar surface textures or material properties. For example, a leather sofa and a leather jacket have different appearances but share similar tactile properties. This implies that tactile understanding can be guided by material cues in visual data, even without direct tactile supervision. In this paper, we introduce RA-Touch, a retrieval-augmented framework that improves visuo-tactile perception by leveraging visual data enriched with tactile semantics. We carefully recaption a large-scale visual dataset with tactile-focused descriptions, enabling the model to access tactile semantics typically absent from conventional visual datasets. A key challenge remains in effectively utilizing these tactile-aware external descriptions. RA-Touch addresses this by retrieving visual-textual representations aligned with tactile inputs and integrating them to focus on relevant textural and material properties. By outperforming prior methods on the TVL benchmark, our method demonstrates the potential of retrieval-based visual reuse for tactile understanding. Code is available at https://aim-skku.github.io/RA-Touch
Ziyang Zeng, Dun Zhang, Jiacheng Li, Panxiang Zou, Yuqing Yang
This study investigates a specific form of positional bias, termed the Myopic
Trap, where retrieval models disproportionately attend to the early parts of
documents while overlooking relevant information that appears later. To
systematically quantify this phenomenon, we propose a semantics-preserving
evaluation framework that repurposes the existing NLP datasets into
position-aware retrieval benchmarks. By evaluating the SOTA models of full
retrieval pipeline, including BM25, embedding models, ColBERT-style
late-interaction models, and reranker models, we offer a broader empirical
perspective on positional bias than prior work. Experimental results show that
embedding models and ColBERT-style models exhibit significant performance
degradation when query-related content is shifted toward later positions,
indicating a pronounced head bias. Notably, under the same training
configuration, ColBERT-style approach show greater potential for mitigating
positional bias compared to the traditional single-vector approach. In
contrast, BM25 and reranker models remain largely unaffected by such
perturbations, underscoring their robustness to positional bias. Code and data
are publicly available at: www.github.com/NovaSearch-Team/RAG-Retrieval.
Authors' comments: 10 pages, 3 figures, 4 tables. Under review
Qifeng Cai, Hao Liang, Hejun Dong, Meiyi Qiang, Ruichuan An, Zhaoyang Han, Zhengzhou Zhu, Bin Cui et al.
Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://github.com/TechNomad-ds/LoVR-benchmark
Junyu Luo, Yusheng Zhao, Xiao Luo, Zhiping Xiao, Wei Ju, Li Shen, Dacheng Tao, Ming Zhang
Unsupervised efficient domain adaptive retrieval aims to transfer knowledge
from a labeled source domain to an unlabeled target domain, while maintaining
low storage cost and high retrieval efficiency. However, existing methods
typically fail to address potential noise in the target domain, and directly
align high-level features across domains, thus resulting in suboptimal
retrieval performance. To address these challenges, we propose a novel
Cross-Domain Diffusion with Progressive Alignment method (COUPLE). This
approach revisits unsupervised efficient domain adaptive retrieval from a graph
diffusion perspective, simulating cross-domain adaptation dynamics to achieve a
stable target domain adaptation process. First, we construct a cross-domain
relationship graph and leverage noise-robust graph flow diffusion to simulate
the transfer dynamics from the source domain to the target domain, identifying
lower noise clusters. We then leverage the graph diffusion results for
discriminative hash code learning, effectively learning from the target domain
while reducing the negative impact of noise. Furthermore, we employ a
hierarchical Mixup operation for progressive domain alignment, which is
performed along the cross-domain random walk paths. Utilizing target domain
discriminative hash learning and progressive domain alignment, COUPLE enables
effective domain adaptive hash learning. Extensive experiments demonstrate
COUPLE's effectiveness on competitive benchmarks.
Authors' comments: IEEE TIP
Yuning Jiang, Feiyang Shang, Freedy Tan Wei You, Huilin Wang, Chia Ren Cong, Qiaoran Meng, Nay Oo, Hoon Wei Lim et al.
The dynamic landscape of cybersecurity demands precise and scalable solutions for vulnerability management in heterogeneous systems, where configuration-specific vulnerabilities are often misidentified due to inconsistent data in databases like the National Vulnerability Database (NVD). Inaccurate Common Platform Enumeration (CPE) data in NVD further leads to false positives and incomplete vulnerability retrieval. Informed by our systematic analysis of CPE and CVEdeails data, revealing more than 50% vendor name inconsistencies, we propose VulCPE, a framework that standardizes data and models configuration dependencies using a unified CPE schema (uCPE), entity recognition, relation extraction, and graph-based modeling. VulCPE achieves superior retrieval precision (0.766) and coverage (0.926) over existing tools. VulCPE ensures precise, context-aware vulnerability management, enhancing cyber resilience.
Runchu Tian, Xueqiang Xu, Bowen Jin, SeongKu Kang, Jiawei Han
Scientific retrieval is essential for advancing academic discovery. Within
this process, document reranking plays a critical role by refining first-stage
retrieval results. However, large language model (LLM) listwise reranking faces
unique challenges in the scientific domain. First-stage retrieval is often
suboptimal in the scientific domain, so relevant documents are ranked lower.
Moreover, conventional listwise reranking uses the full text of candidate
documents in the context window, limiting the number of candidates that can be
considered. As a result, many relevant documents are excluded before reranking,
which constrains overall retrieval performance. To address these challenges, we
explore compact document representations based on semantic features such as
categories, sections, and keywords, and propose a training-free, model-agnostic
reranking framework for scientific retrieval called CoRank. The framework
involves three stages: (i) offline extraction of document-level features, (ii)
coarse reranking using these compact representations, and (iii) fine-grained
reranking on full texts of the top candidates from stage (ii). This hybrid
design provides a high-level abstraction of document semantics, expands
candidate coverage, and retains critical details required for precise ranking.
Experiments on LitSearch and CSFCube show that CoRank significantly improves
reranking performance across different LLM backbones, increasing nDCG@10 from
32.0 to 39.7. Overall, these results highlight the value of information
extraction for reranking in scientific retrieval.
Authors' comments: 17 pages, 4 figures
Tommaso Mario Buonocore, Enea Parimbelli
Content moderation for large language models (LLMs) remains a significant
challenge, requiring flexible and adaptable solutions that can quickly respond
to emerging threats. This paper introduces Retrieval Augmented Rejection (RAR),
a novel approach that leverages a retrieval-augmented generation (RAG)
architecture to dynamically reject unsafe user queries without model
retraining. By strategically inserting and marking malicious documents into the
vector database, the system can identify and reject harmful requests when these
documents are retrieved. Our preliminary results show that RAR achieves
comparable performance to embedded moderation in LLMs like Claude 3.5 Sonnet,
while offering superior flexibility and real-time customization capabilities, a
fundamental feature to timely address critical vulnerabilities. This approach
introduces no architectural changes to existing RAG systems, requiring only the
addition of specially crafted documents and a simple rejection mechanism based
on retrieval results.
Authors' comments: 7 pages, 4 figures, 2 tables
Kevin Chenhao Li, Vahid Zolfaghari, Nenad Petrovic, Fengjunjie Pan, Alois Knoll
The Object Constraint Language (OCL) is essential for defining precise constraints within Model-Based Systems Engineering (MBSE). However, manually writing OCL rules is complex and time-consuming. This study explores the optimization of Retrieval-Augmented Generation (RAG) for automating OCL rule generation, focusing on the impact of different retrieval strategies. We evaluate three retrieval approaches $\unicode{x2013}$ BM25 (lexical-based), BERT-based (semantic retrieval), and SPLADE (sparse-vector retrieval) $\unicode{x2013}$ analyzing their effectiveness in providing relevant context for a large language model. To further assess our approach, we compare and benchmark our retrieval-optimized generation results against PathOCL, a state-of-the-art graph-based method. We directly compare BM25, BERT, and SPLADE retrieval methods with PathOCL to understand how different retrieval methods perform for a unified evaluation framework. Our experimental results, focusing on retrieval-augmented generation, indicate that while retrieval can enhance generation accuracy, its effectiveness depends on the retrieval method and the number of retrieved chunks (k). BM25 underperforms the baseline, whereas semantic approaches (BERT and SPLADE) achieve better results, with SPLADE performing best at lower k values. However, excessive retrieval with high k parameter can lead to retrieving irrelevant chunks which degrades model performance. Our findings highlight the importance of optimizing retrieval configurations to balance context relevance and output consistency. This research provides insights into improving OCL rule generation using RAG and underscores the need for tailoring retrieval.