Xiaosheng Long, Hanyu Wang, Zhentao Song, Kun Luo, Hongde Liu
Recent retrieval-augmented image captioning methods incorporate external knowledge to compensate for the limitations in comprehending complex scenes. However, current approaches face challenges in relation modeling: (1) the representation of semantic prompts is too coarse-grained to capture fine-grained relationships; (2) these methods lack explicit modeling of image objects and their semantic relationships. To address these limitations, we propose RACap, a relation-aware retrieval-augmented model for image captioning, which not only mines structured relation semantics from retrieval captions, but also identifies heterogeneous objects from the image. RACap effectively retrieves structured relation features that contain heterogeneous visual information to enhance the semantic consistency and relational expressiveness. Experimental results show that RACap, with only 10.8M trainable parameters, achieves superior performance compared to previous lightweight captioning models.
Liwei Liao, Xufeng Li, Xiaoyun Zheng, Boning Liu, Feng Gao, Ronggang Wang
3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on text prompts, which is essential for applications such as robotics. However, existing 3DVG methods encounter two main challenges: first, they struggle to handle the implicit representation of spatial textures in 3D Gaussian Splatting (3DGS), making per-scene training indispensable; second, they typically require larges amounts of labeled data for effective training. To this end, we propose \underline{G}rounding via \underline{V}iew \underline{R}etrieval (GVR), a novel zero-shot visual grounding framework for 3DGS to transform 3DVG as a 2D retrieval task that leverages object-level view retrieval to collect grounding clues from multiple views, which not only avoids the costly process of 3D annotation, but also eliminates the need for per-scene training. Extensive experiments demonstrate that our method achieves state-of-the-art visual grounding performance while avoiding per-scene training, providing a solid foundation for zero-shot 3DVG research. Video demos can be found in https://github.com/leviome/GVR_demos.
Yufeng Li, Arkaitz Zubiaga
The rapid spread of misinformation on social media underscores the need for
scalable fact-checking tools. A key step is claim detection, which identifies
statements that can be objectively verified. Prior approaches often rely on
linguistic cues or claim check-worthiness, but these struggle with vague
political discourse and diverse formats such as tweets. We present RAVE
(Retrieval and Scoring Aware Verifiable Claim Detection), a framework that
combines evidence retrieval with structured signals of relevance and source
credibility. Experiments on CT22-test and PoliClaim-test show that RAVE
consistently outperforms text-only and retrieval-based baselines in both
accuracy and F1.
Authors' comments: 5 pages, 1 figure
Chong You, Rajesh Jayaram, Ananda Theertha Suresh, Robin Nittka, Felix Yu, Sanjiv Kumar
Dual encoder (DE) models, where a pair of matching query and document are embedded into similar vector representations, are widely used in information retrieval due to their simplicity and scalability. However, the Euclidean geometry of the embedding space limits the expressive power of DEs, which may compromise their quality. This paper investigates such limitations in the context of hierarchical retrieval (HR), where the document set has a hierarchical structure and the matching documents for a query are all of its ancestors. We first prove that DEs are feasible for HR as long as the embedding dimension is linear in the depth of the hierarchy and logarithmic in the number of documents. Then we study the problem of learning such embeddings in a standard retrieval setup where DEs are trained on samples of matching query and document pairs. Our experiments reveal a lost-in-the-long-distance phenomenon, where retrieval accuracy degrades for documents further away in the hierarchy. To address this, we introduce a pretrain-finetune recipe that significantly improves long-distance retrieval without sacrificing performance on closer documents. We experiment on a realistic hierarchy from WordNet for retrieving documents at various levels of abstraction, and show that pretrain-finetune boosts the recall on long-distance pairs from 19% to 76%. Finally, we demonstrate that our method improves retrieval of relevant products on a shopping queries dataset.
Authors' comments: NeurIPS 2025
Ruohan Zhang, Jiacheng Li, Julian McAuley, Yupeng Hou
Semantic identifiers (IDs) have proven effective in adapting large language models for generative recommendation and retrieval. However, existing methods often suffer from semantic ID conflicts, where semantically similar documents (or items) are assigned identical IDs. A common strategy to avoid conflicts is to append a non-semantic token to distinguish them, which introduces randomness and expands the search space, therefore hurting performance. In this paper, we propose purely semantic indexing to generate unique, semantic-preserving IDs without appending non-semantic tokens. We enable unique ID assignment by relaxing the strict nearest-centroid selection and introduce two model-agnostic algorithms: exhaustive candidate matching (ECM) and recursive residual searching (RRS). Extensive experiments on sequential recommendation, product search, and document retrieval tasks demonstrate that our methods improve both overall and cold-start performance, highlighting the effectiveness of ensuring ID uniqueness.
Saket S. Chaturvedi, Gaurav Bagwe, Lan Zhang, Xiaoyong Yuan
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by
retrieving relevant documents from external sources to improve factual accuracy
and verifiability. However, this reliance introduces new attack surfaces within
the retrieval pipeline, beyond the LLM itself. While prior RAG attacks have
exposed such vulnerabilities, they largely rely on manipulating user queries,
which is often infeasible in practice due to fixed or protected user inputs.
This narrow focus overlooks a more realistic and stealthy vector: instructional
prompts, which are widely reused, publicly shared, and rarely audited. Their
implicit trust makes them a compelling target for adversaries to manipulate RAG
behavior covertly.
We introduce a novel attack for Adversarial Instructional Prompt (AIP) that
exploits adversarial instructional prompts to manipulate RAG outputs by subtly
altering retrieval behavior. By shifting the attack surface to the
instructional prompts, AIP reveals how trusted yet seemingly benign interface
components can be weaponized to degrade system integrity. The attack is crafted
to achieve three goals: (1) naturalness, to evade user detection; (2) utility,
to encourage use of prompts; and (3) robustness, to remain effective across
diverse query variations. We propose a diverse query generation strategy that
simulates realistic linguistic variation in user queries, enabling the
discovery of prompts that generalize across paraphrases and rephrasings.
Building on this, a genetic algorithm-based joint optimization is developed to
evolve adversarial prompts by balancing attack success, clean-task utility, and
stealthiness. Experimental results show that AIP achieves up to 95.23% ASR
while preserving benign functionality. These findings uncover a critical and
previously overlooked vulnerability in RAG systems, emphasizing the need to
reassess the shared instructional prompts.
Authors' comments: Accepted at EMNLP 2025 Conference
Arda Kabadayi, Senem Velipasalar, Jiajing Chen
Compared to traditional image retrieval tasks, product retrieval in retail settings is even more challenging. Products of the same type from different brands may have highly similar visual appearances, and the query image may be taken from an angle that differs significantly from view angles of the stored catalog images. Foundational models, such as CLIP and SigLIP, often struggle to distinguish these subtle but important local differences. Pixel-wise matching methods, on the other hand, are computationally expensive and incur prohibitively high matching times. In this paper, we propose a new, hybrid method, called PRISM, for product retrieval in retail settings by leveraging the advantages of both vision-language model-based and pixel-wise matching approaches. To provide both efficiency/speed and finegrained retrieval accuracy, PRISM consists of three stages: 1) A vision-language model (SigLIP) is employed first to retrieve the top 35 most semantically similar products from a fixed gallery, thereby narrowing the search space significantly; 2) a segmentation model (YOLO-E) is applied to eliminate background clutter; 3) fine-grained pixel-level matching is performed using LightGlue across the filtered candidates. This framework enables more accurate discrimination between products with high inter-class similarity by focusing on subtle visual cues often missed by global models. Experiments performed on the ABV dataset show that our proposed PRISM outperforms the state-of-the-art image retrieval methods by 4.21% in top-1 accuracy while still remaining within the bounds of real-time processing for practical retail deployments.
Longfei Zuo, Pingjun Hong, Oliver Kraus, Barbara Plank, Robert Litschko
Multi-stage information retrieval (IR) has become a widely-adopted paradigm
in search. While Large Language Models (LLMs) have been extensively evaluated
as second-stage reranking models for monolingual IR, a systematic large-scale
comparison is still lacking for cross-lingual IR (CLIR). Moreover, while prior
work shows that LLM-based rerankers improve CLIR performance, their evaluation
setup relies on lexical retrieval with machine translation (MT) for the first
stage. This is not only prohibitively expensive but also prone to error
propagation across stages. Our evaluation on passage-level and document-level
CLIR reveals that further gains can be achieved with multilingual bi-encoders
as first-stage retrievers and that the benefits of translation diminishes with
stronger reranking models. We further show that pairwise rerankers based on
instruction-tuned LLMs perform competitively with listwise rerankers. To the
best of our knowledge, we are the first to study the interaction between
retrievers and rerankers in two-stage CLIR with LLMs. Our findings reveal that,
without MT, current state-of-the-art rerankers fall severely short when
directly applied in CLIR.
Authors' comments: Accepted at EMNLP 2025 (Findings)
Shangrong Wu, Yanghong Zhou, Yang Chen, Feng Zhang, P. Y. Mok
Image retrieval remains a fundamental yet challenging problem in computer vision. While recent advances in Multimodal Large Language Models (MLLMs) have demonstrated strong reasoning capabilities, existing methods typically employ them only for evaluation, without involving them directly in the ranking process. As a result, their rich multimodal reasoning abilities remain underutilized, leading to suboptimal performance. In this paper, we propose a novel Chain-of-Thought Re-Ranking (CoTRR) method to address this issue. Specifically, we design a listwise ranking prompt that enables MLLM to directly participate in re-ranking candidate images. This ranking process is grounded in an image evaluation prompt, which assesses how well each candidate aligns with users query. By allowing MLLM to perform listwise reasoning, our method supports global comparison, consistent reasoning, and interpretable decision-making - all of which are essential for accurate image retrieval. To enable structured and fine-grained analysis, we further introduce a query deconstruction prompt, which breaks down the original query into multiple semantic components. Extensive experiments on five datasets demonstrate the effectiveness of our CoTRR method, which achieves state-of-the-art performance across three image retrieval tasks, including text-to-image retrieval (TIR), composed image retrieval (CIR) and chat-based image retrieval (Chat-IR). Our code is available at https://github.com/freshfish15/CoTRR .
Suyuchen Wang, Jinlin Wang, Xinyu Wang, Shiqi Li, Xiangru Tang, Sirui Hong, Xiao-Wen Chang, Chenglin Wu et al.
Large language models (LLMs) often struggle with context fidelity, producing
inconsistent answers when responding to questions based on provided
information. Existing approaches either rely on expensive supervised
fine-tuning to generate evidence post-answer or train models to perform web
searches without necessarily improving utilization of the given context. We
propose CARE, a novel native retrieval-augmented reasoning framework that
teaches LLMs to explicitly integrate in-context evidence within their reasoning
process with the model's own retrieval capabilities. Our method requires
limited labeled evidence data while significantly enhancing both retrieval
accuracy and answer generation performance through strategically retrieved
in-context tokens in the reasoning chain. Extensive experiments on multiple
real-world and counterfactual QA benchmarks demonstrate that our approach
substantially outperforms supervised fine-tuning, traditional
retrieval-augmented generation methods, and external retrieval solutions. This
work represents a fundamental advancement in making LLMs more accurate,
reliable, and efficient for knowledge-intensive tasks.
Authors' comments: Accepted as a main conference paper at EMNLP 2025
Yifan Liu, Qianfeng Wen, Mark Zhao, Jiazhou Liang, Scott Sanner
Dense Passage Retrieval (DPR) typically relies on Euclidean or cosine
distance to measure query-passage relevance in embedding space, which is
effective when embeddings lie on a linear manifold. However, our experiments
across DPR benchmarks suggest that embeddings often lie on lower-dimensional,
non-linear manifolds, especially in out-of-distribution (OOD) settings, where
cosine and Euclidean distance fail to capture semantic similarity. To address
this limitation, we propose a manifold-aware distance metric for DPR (MA-DPR)
that models the intrinsic manifold structure of passages using a nearest
neighbor graph and measures query-passage distance based on their shortest path
in this graph. We show that MA-DPR outperforms Euclidean and cosine distances
by up to 26% on OOD passage retrieval with comparable in-distribution
performance across various embedding models while incurring a minimal increase
in query inference time. Empirical evidence suggests that manifold-aware
distance allows DPR to leverage context from related neighboring passages,
making it effective even in the absence of direct semantic overlap. MADPR can
be applied to a wide range of dense embedding and retrieval tasks, offering
potential benefits across a wide spectrum of domains.
Authors' comments: 19 pages, 8 figures
Gwendal Le Vaillant, Yannick Molle
Efficiently retrieving specific instrument timbres from audio mixtures remains a challenge in digital music production. This paper introduces a contrastive learning framework for musical instrument retrieval, enabling direct querying of instrument databases using a single model for both single- and multi-instrument sounds. We propose techniques to generate realistic positive/negative pairs of sounds for virtual musical instruments, such as samplers and synthesizers, addressing limitations in common audio data augmentation methods. The first experiment focuses on instrument retrieval from a dataset of 3,884 instruments, using single-instrument audio as input. Contrastive approaches are competitive with previous works based on classification pre-training. The second experiment considers multi-instrument retrieval with a mixture of instruments as audio input. In this case, the proposed contrastive framework outperforms related works, achieving 81.7\% top-1 and 95.7\% top-5 accuracies for three-instrument mixtures.
Taher Yacoub, Camille Depenveiller, Atsushi Tatsuma, Tin Barisin, Eugen Rusakov, Udo Gobel, Yuxu Peng, Shiqiang Deng et al.
This SHREC 2025 track dedicated to protein surface shape retrieval involved 9
participating teams. We evaluated the performance in retrieval of 15 proposed
methods on a large dataset of 11,555 protein surfaces with calculated
electrostatic potential (a key molecular surface descriptor). The performance
in retrieval of the proposed methods was evaluated through different metrics
(Accuracy, Balanced accuracy, F1 score, Precision and Recall). The best
retrieval performance was achieved by the proposed methods that used the
electrostatic potential complementary to molecular surface shape. This
observation was also valid for classes with limited data which highlights the
importance of taking into account additional molecular surface descriptors.
Authors' comments: Published in Computers & Graphics, Elsevier. 59 pages, 12 figures
Hanqing Li, Kiran Sheena Jyothi, Henry Liang, Sharika Mahadevan, Diego Klabjan
We propose a new, training-free method, Graph Reasoning via Retrieval Augmented Framework (GRRAF), that harnesses retrieval-augmented generation (RAG) alongside the code-generation capabilities of large language models (LLMs) to address a wide range of graph reasoning tasks. In GRRAF, the target graph is stored in a graph database, and the LLM is prompted to generate executable code queries that retrieve the necessary information. This approach circumvents the limitations of existing methods that require extensive finetuning or depend on predefined algorithms, and it incorporates an error feedback loop with a time-out mechanism to ensure both correctness and efficiency. Experimental evaluations on the GraphInstruct dataset reveal that GRRAF achieves 100% accuracy on most graph reasoning tasks, including cycle detection, bipartite graph checks, shortest path computation, and maximum flow, while maintaining consistent token costs regardless of graph sizes. Imperfect but still very high performance is observed on subgraph matching. Notably, GRRAF scales effectively to large graphs with up to 10,000 nodes.
Zequn Xie, Chuxin Wang, Sihang Cai, Yeqiang Wang, Shulei Wang, Tao Jin
Text-based person search (TBPS) enables the retrieval of person images from
large-scale databases using natural language descriptions, offering critical
value in surveillance applications. However, a major challenge lies in the
labor-intensive process of obtaining high-quality textual annotations, which
limits scalability and practical deployment. To address this, we introduce two
complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text
Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues
with MLLMs, producing fine-grained and diverse visual descriptions without
manual supervision. MTI refines user queries at inference time through dynamic,
dialogue-based reasoning, enabling the system to interpret and resolve vague,
incomplete, or ambiguous descriptions - characteristics often seen in
real-world search scenarios. Together, MTG and MTI form a unified and
annotation-free framework that significantly improves retrieval accuracy,
robustness, and usability. Extensive evaluations demonstrate that our method
achieves competitive or superior results while eliminating the need for manual
captions, paving the way for scalable and practical deployment of TBPS systems.
Authors' comments: Accepted by EMNLP 2025. 13 pages, 3 figures
Wonbin Kweon, SeongKu Kang, Runchu Tian, Pengcheng Jiang, Jiawei Han, Hwanjo Yu
The effectiveness of in-context learning relies heavily on selecting
demonstrations that provide all the necessary information for a given test
input. To achieve this, it is crucial to identify and cover fine-grained
knowledge requirements. However, prior methods often retrieve demonstrations
based solely on embedding similarity or generation probability, resulting in
irrelevant or redundant examples. In this paper, we propose TopicK, a topic
coverage-based retrieval framework that selects demonstrations to
comprehensively cover topic-level knowledge relevant to both the test input and
the model. Specifically, TopicK estimates the topics required by the input and
assesses the model's knowledge on those topics. TopicK then iteratively selects
demonstrations that introduce previously uncovered required topics, in which
the model exhibits low topical knowledge. We validate the effectiveness of
TopicK through extensive experiments across various datasets and both open- and
closed-source LLMs. Our source code is available at
https://github.com/WonbinKweon/TopicK_EMNLP2025.
Authors' comments: EMNLP 2025 Main
Ilya Tyagin, Saeideh Valipour, Aliaksandra Sikirzhytskaya, Michael Shtutman, Ilya Safro
We introduce an explainability method for biomedical hypothesis generation
systems, built on top of the novel Hypothesis Generation Context Retriever
framework. Our approach combines semantic graph-based retrieval and relevant
data-restrictive training to simulate real-world discovery constraints.
Integrated with large language models (LLMs) via retrieval-augmented
generation, the system explains hypotheses with contextual evidence using
published scientific literature. We also propose a novel feedback loop
approach, which iteratively identifies and corrects flawed parts of
LLM-generated explanations, refining both the evidence paths and supporting
context. We demonstrate the performance of our method with multiple large
language models and evaluate the explanation and context retrieval quality
through both expert-curated assessment and large-scale automated analysis. Our
code is available at: https://github.com/IlyaTyagin/HGCR.
Authors' comments: 30 pages, 10 figures,
Wensheng Lu, Keyu Chen, Ruizhi Qiao, Xing Sun
Retrieval-Augmented Generation (RAG) enhances the response capabilities of
language models by integrating external knowledge sources. However, document
chunking as an important part of RAG system often lacks effective evaluation
tools. This paper first analyzes why existing RAG evaluation benchmarks are
inadequate for assessing document chunking quality, specifically due to
evidence sparsity. Based on this conclusion, we propose HiCBench, which
includes manually annotated multi-level document chunking points, synthesized
evidence-dense quetion answer(QA) pairs, and their corresponding evidence
sources. Additionally, we introduce the HiChunk framework, a multi-level
document structuring framework based on fine-tuned LLMs, combined with the
Auto-Merge retrieval algorithm to improve retrieval quality. Experiments
demonstrate that HiCBench effectively evaluates the impact of different
chunking methods across the entire RAG pipeline. Moreover, HiChunk achieves
better chunking quality within reasonable time consumption, thereby enhancing
the overall performance of RAG systems.
Authors' comments: 17 pages, 5 figures, 6 tables
Mustapha Adamu, Qi Zhang, Huitong Pan, Longin Jan Latecki, Eduard C. Dragut
The growing complexity and volume of climate science literature make it
increasingly difficult for researchers to find relevant information across
models, datasets, regions, and variables. This paper introduces a
domain-specific Knowledge Graph (KG) built from climate publications and
broader scientific texts, aimed at improving how climate knowledge is accessed
and used. Unlike keyword based search, our KG supports structured, semantic
queries that help researchers discover precise connections such as which models
have been validated in specific regions or which datasets are commonly used
with certain teleconnection patterns. We demonstrate how the KG answers such
questions using Cypher queries, and outline its integration with large language
models in RAG systems to improve transparency and reliability in
climate-related question answering. This work moves beyond KG construction to
show its real world value for climate researchers, model developers, and others
who rely on accurate, contextual scientific information.
Authors' comments: ACM SIGIR 2025 Workshop MANILA
Rodrigo Braz Teixeira, Izaak Neri, Pablo Sartori
Liquid mixtures can separate into phases with distinct composition. This phenomenon has recently come back to prominence due to its role in complex biological liquids, such as the cytoplasm, which contain thousands of components. For simple two-component mixtures phase-separated states are global free energy minima. However, local free energy minima, i.e. metastable states, are known to play a dominant role in complex systems with many components. For example, Hopfield neural networks can retrieve information from partial cues via relaxation to metastable states. Under what conditions can phase separated states be metastable, and what are the implications for information processing in multicomponent liquids? In this work we develop the general thermodynamic formalism of metastable phase separation. We then apply this formalism to an illustrative toy example inspired by recent experiments, binary mixtures with high-order interactions. Finally, as core application of the formalism, we study metastability in Hopfield liquids, a class of multicomponent mixtures capable of storing information on the composition of phases. We show that these phases can be retrieved from partial cues via metastable phase separation. Spatial simulations of liquids with a large number of components match our analytical solution. Our work suggests that complex biological mixtures can perform information retrieval through metastable phase separation.
Authors' comments: 26 pages, 8 figures, 16 pages of supplement