Tianyuan Li, Lei Wang, Ahtamjan Ahmat, Yating Yang, Bo Ma, Rui Dong, Bangju Han
Generative cross-modal retrieval, which treats retrieval as a generation task, has emerged as a promising direction with the rise of Multimodal Large Language Models (MLLMs). In this setting, the model responds to a text query by generating an identifier corresponding to the target image. However, existing methods typically rely on manually crafted string IDs, clustering-based labels, or atomic identifiers requiring vocabulary expansion, all of which face challenges in semantic alignment or scalability.To address these limitations, we propose a vocabulary-efficient identifier generation framework that prompts MLLMs to generate Structured Semantic Identifiers from image-caption pairs. These identifiers are composed of concept-level tokens such as objects and actions, naturally aligning with the model's generation space without modifying the tokenizer. Additionally, we introduce a Rationale-Guided Supervision Strategy, prompting the model to produce a one-sentence explanation alongside each identifier serves as an auxiliary supervision signal that improves semantic grounding and reduces hallucinations during training.
Dayu Yang, Hui Fang
Connecting conversation with external domain knowledge is vital for conversational recommender systems (CRS) to correctly understand user preferences. However, existing solutions either require domain-specific engineering, which limits flexibility, or rely solely on large language models, which increases the risk of hallucination. While Retrieval-Augmented Generation (RAG) holds promise, its naive use in CRS is hindered by noisy dialogues that weaken retrieval and by overlooked nuances among similar items. We propose ReGeS, a reciprocal Retrieval-Generation Synergy framework that unifies generation-augmented retrieval to distill informative user intent from conversations and retrieval-augmented generation to differentiate subtle item features. This synergy obviates the need for extra annotations, reduces hallucinations, and simplifies continuous updates. Experiments on multiple CRS benchmarks show that ReGeS achieves state-of-the-art performance in recommendation accuracy, demonstrating the effectiveness of reciprocal synergy for knowledge-intensive CRS tasks.
Authors' comments: Accepted by WISE 2025: 26th International Web Information Systems Engineering conference. Our code is publicly available at the link: https://github.com/dayuyang1999/ReGeS
Jiale Deng, Yanyan Shen, Ziyuan Pei, Youmin Chen, Linpeng Huang
Retrieval-Augmented Generation (RAG) addresses large language model (LLM) hallucinations by grounding responses in external knowledge, but its effectiveness is compromised by poor-quality retrieved contexts containing irrelevant or noisy information. While existing approaches attempt to improve performance through context selection based on predefined context quality assessment metrics, they show limited gains over standard RAG. We attribute this limitation to their failure in holistically utilizing available information (query, context list, and generator) for comprehensive quality assessment. Inspired by recent advances in data selection, we reconceptualize context quality assessment as an inference-time data valuation problem and introduce the Contextual Influence Value (CI value). This novel metric quantifies context quality by measuring the performance degradation when removing each context from the list, effectively integrating query-aware relevance, list-aware uniqueness, and generator-aware alignment. Moreover, CI value eliminates complex selection hyperparameter tuning by simply retaining contexts with positive CI values. To address practical challenges of label dependency and computational overhead, we develop a parameterized surrogate model for CI value prediction during inference. The model employs a hierarchical architecture that captures both local query-context relevance and global inter-context interactions, trained through oracle CI value supervision and end-to-end generator feedback. Extensive experiments across 8 NLP tasks and multiple LLMs demonstrate that our context selection method significantly outperforms state-of-the-art baselines, effectively filtering poor-quality contexts while preserving critical information. Code is available at https://github.com/SJTU-DMTai/RAG-CSM.
Shunyu Liu, Guangdong Bai, Mark Utting, Guowei Yang
Automated Program Repair (APR) has emerged as a promising paradigm for
reducing debugging time and improving the overall efficiency of software
development. Recent advances in Large Language Models (LLMs) have demonstrated
their potential for automated bug fixing and other software engineering tasks.
Nevertheless, the general-purpose nature of LLM pre-training means these models
often lack the capacity to perform project-specific repairs, which require
understanding of domain-specific identifiers, code structures, and contextual
relationships within a particular codebase. As a result, LLMs may struggle to
generate correct patches when the repair depends on project-specific
information.
To address this limitation, we introduce RelRepair, a novel approach that
retrieves relevant project-specific code to enhance automated program repair.
RelRepair first identifies relevant function signatures by analyzing function
names and code comments within the project. It then conducts deeper code
analysis to retrieve code snippets relevant to the repair context. The
retrieved relevant information is then incorporated into the LLM's input
prompt, guiding the model to generate more accurate and informed patches. We
evaluate RelRepair on two widely studied datasets, Defects4J V1.2 and
ManySStuBs4J, and compare its performance against several state-of-the-art
LLM-based APR approaches. RelRepair successfully repairs 101 bugs in Defects4J
V1.2. Furthermore, RelRepair achieves a 17.1\% improvement in the ManySStuBs4J
dataset, increasing the overall fix rate to 48.3\%. These results highlight the
importance of providing relevant project-specific information to LLMs, shedding
light on effective strategies for leveraging LLMs in APR tasks.
Authors' comments: 11 pages, 5 figures, under review at TSE
Hyun Jun Kim, Hyeong Yong Choi, Changwon Lim
This report presents the AISTAT team's submission to the language-based audio
retrieval task in DCASE 2025 Task 6. Our proposed system employs dual encoder
architecture, where audio and text modalities are encoded separately, and their
representations are aligned using contrastive learning. Drawing inspiration
from methodologies of the previous year's challenge, we implemented a
distillation approach and leveraged large language models (LLMs) for effective
data augmentation techniques, including back-translation and LLM mix.
Additionally, we incorporated clustering to introduce an auxiliary
classification task for further finetuning. Our best single system achieved a
mAP@16 of 46.62, while an ensemble of four systems reached a mAP@16 of 48.83 on
the Clotho development test split.
Authors' comments: 5 pages, 1 figure, DCASE2025 Task2 technical report
Xulin Li, Yan Lu, Bin Liu, Jiaze Li, Qinhong Yang, Tao Gong, Qi Chu, Mang Ye et al.
In real applications, person re-identification (ReID) is expected to retrieve
the target person at any time, including both daytime and nighttime, ranging
from short-term to long-term. However, existing ReID tasks and datasets can not
meet this requirement, as they are constrained by available time and only
provide training and evaluation for specific scenarios. Therefore, we
investigate a new task called Anytime Person Re-identification (AT-ReID), which
aims to achieve effective retrieval in multiple scenarios based on variations
in time. To address the AT-ReID problem, we collect the first large-scale
dataset, AT-USTC, which contains 403k images of individuals wearing multiple
clothes captured by RGB and IR cameras. Our data collection spans 21 months,
and 270 volunteers were photographed on average 29.1 times across different
dates or scenes, 4-15 times more than current datasets, providing conditions
for follow-up investigations in AT-ReID. Further, to tackle the new challenge
of multi-scenario retrieval, we propose a unified model named Uni-AT, which
comprises a multi-scenario ReID (MS-ReID) framework for scenario-specific
features learning, a Mixture-of-Attribute-Experts (MoAE) module to alleviate
inter-scenario interference, and a Hierarchical Dynamic Weighting (HDW)
strategy to ensure balanced training across all scenarios. Extensive
experiments show that our model leads to satisfactory results and exhibits
excellent generalization to all scenarios.
Authors' comments: Accepted by IJCAI 2025
Jialin Chen, Houyu Zhang, Seongjun Yun, Alejandro Mottini, Rex Ying, Xiang Song, Vassilis N. Ioannidis, Zheng Li et al.
Retrieval-Augmented Generation (RAG) has significantly mitigated the hallucinations of Large Language Models (LLMs) by grounding the generation with external knowledge. Recent extensions of RAG to graph-based retrieval offer a promising direction, leveraging the structural knowledge for multi-hop reasoning. However, existing graph RAG typically decouples retrieval and reasoning processes, which prevents the retriever from adapting to the reasoning needs of the LLM. They also struggle with scalability when performing multi-hop expansion over large-scale graphs, or depend heavily on annotated ground-truth entities, which are often unavailable in open-domain settings. To address these challenges, we propose a novel graph retriever trained end-to-end with LLM, which features an attention-based growing and pruning mechanism, adaptively navigating multi-hop relevant entities while filtering out noise. Within the extracted subgraph, structural knowledge and semantic features are encoded via soft tokens and the verbalized graph, respectively, which are infused into the LLM together, thereby enhancing its reasoning capability and facilitating interactive joint training of the graph retriever and the LLM reasoner. Experimental results across three QA benchmarks show that our approach consistently achieves state-of-the-art performance, validating the strength of joint graph-LLM optimization for complex reasoning tasks. Notably, our framework eliminates the need for predefined ground-truth entities by directly optimizing the retriever using LLM logits as implicit feedback, making it especially effective in open-domain settings.
Jongyeop Hyun, Bumsoo Kim
Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback -- Feed-Target, Feed-Check, and Feed-Path -- to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE's potential for enhancing multimodal reasoning.
Authors' comments: Accepted at EMNLP 2025 main conference
Xiaosheng Long, Hanyu Wang, Zhentao Song, Kun Luo, Hongde Liu
Recent retrieval-augmented image captioning methods incorporate external knowledge to compensate for the limitations in comprehending complex scenes. However, current approaches face challenges in relation modeling: (1) the representation of semantic prompts is too coarse-grained to capture fine-grained relationships; (2) these methods lack explicit modeling of image objects and their semantic relationships. To address these limitations, we propose RACap, a relation-aware retrieval-augmented model for image captioning, which not only mines structured relation semantics from retrieval captions, but also identifies heterogeneous objects from the image. RACap effectively retrieves structured relation features that contain heterogeneous visual information to enhance the semantic consistency and relational expressiveness. Experimental results show that RACap, with only 10.8M trainable parameters, achieves superior performance compared to previous lightweight captioning models.
Liwei Liao, Xufeng Li, Xiaoyun Zheng, Boning Liu, Feng Gao, Ronggang Wang
3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on text prompts, which is essential for applications such as robotics. However, existing 3DVG methods encounter two main challenges: first, they struggle to handle the implicit representation of spatial textures in 3D Gaussian Splatting (3DGS), making per-scene training indispensable; second, they typically require larges amounts of labeled data for effective training. To this end, we propose \underline{G}rounding via \underline{V}iew \underline{R}etrieval (GVR), a novel zero-shot visual grounding framework for 3DGS to transform 3DVG as a 2D retrieval task that leverages object-level view retrieval to collect grounding clues from multiple views, which not only avoids the costly process of 3D annotation, but also eliminates the need for per-scene training. Extensive experiments demonstrate that our method achieves state-of-the-art visual grounding performance while avoiding per-scene training, providing a solid foundation for zero-shot 3DVG research. Video demos can be found in https://github.com/leviome/GVR_demos.
Yufeng Li, Arkaitz Zubiaga
The rapid spread of misinformation on social media underscores the need for
scalable fact-checking tools. A key step is claim detection, which identifies
statements that can be objectively verified. Prior approaches often rely on
linguistic cues or claim check-worthiness, but these struggle with vague
political discourse and diverse formats such as tweets. We present RAVE
(Retrieval and Scoring Aware Verifiable Claim Detection), a framework that
combines evidence retrieval with structured signals of relevance and source
credibility. Experiments on CT22-test and PoliClaim-test show that RAVE
consistently outperforms text-only and retrieval-based baselines in both
accuracy and F1.
Authors' comments: 5 pages, 1 figure
Chong You, Rajesh Jayaram, Ananda Theertha Suresh, Robin Nittka, Felix Yu, Sanjiv Kumar
Dual encoder (DE) models, where a pair of matching query and document are embedded into similar vector representations, are widely used in information retrieval due to their simplicity and scalability. However, the Euclidean geometry of the embedding space limits the expressive power of DEs, which may compromise their quality. This paper investigates such limitations in the context of hierarchical retrieval (HR), where the document set has a hierarchical structure and the matching documents for a query are all of its ancestors. We first prove that DEs are feasible for HR as long as the embedding dimension is linear in the depth of the hierarchy and logarithmic in the number of documents. Then we study the problem of learning such embeddings in a standard retrieval setup where DEs are trained on samples of matching query and document pairs. Our experiments reveal a lost-in-the-long-distance phenomenon, where retrieval accuracy degrades for documents further away in the hierarchy. To address this, we introduce a pretrain-finetune recipe that significantly improves long-distance retrieval without sacrificing performance on closer documents. We experiment on a realistic hierarchy from WordNet for retrieving documents at various levels of abstraction, and show that pretrain-finetune boosts the recall on long-distance pairs from 19% to 76%. Finally, we demonstrate that our method improves retrieval of relevant products on a shopping queries dataset.
Authors' comments: NeurIPS 2025
Ruohan Zhang, Jiacheng Li, Julian McAuley, Yupeng Hou
Semantic identifiers (IDs) have proven effective in adapting large language models for generative recommendation and retrieval. However, existing methods often suffer from semantic ID conflicts, where semantically similar documents (or items) are assigned identical IDs. A common strategy to avoid conflicts is to append a non-semantic token to distinguish them, which introduces randomness and expands the search space, therefore hurting performance. In this paper, we propose purely semantic indexing to generate unique, semantic-preserving IDs without appending non-semantic tokens. We enable unique ID assignment by relaxing the strict nearest-centroid selection and introduce two model-agnostic algorithms: exhaustive candidate matching (ECM) and recursive residual searching (RRS). Extensive experiments on sequential recommendation, product search, and document retrieval tasks demonstrate that our methods improve both overall and cold-start performance, highlighting the effectiveness of ensuring ID uniqueness.
Saket S. Chaturvedi, Gaurav Bagwe, Lan Zhang, Xiaoyong Yuan
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by
retrieving relevant documents from external sources to improve factual accuracy
and verifiability. However, this reliance introduces new attack surfaces within
the retrieval pipeline, beyond the LLM itself. While prior RAG attacks have
exposed such vulnerabilities, they largely rely on manipulating user queries,
which is often infeasible in practice due to fixed or protected user inputs.
This narrow focus overlooks a more realistic and stealthy vector: instructional
prompts, which are widely reused, publicly shared, and rarely audited. Their
implicit trust makes them a compelling target for adversaries to manipulate RAG
behavior covertly.
We introduce a novel attack for Adversarial Instructional Prompt (AIP) that
exploits adversarial instructional prompts to manipulate RAG outputs by subtly
altering retrieval behavior. By shifting the attack surface to the
instructional prompts, AIP reveals how trusted yet seemingly benign interface
components can be weaponized to degrade system integrity. The attack is crafted
to achieve three goals: (1) naturalness, to evade user detection; (2) utility,
to encourage use of prompts; and (3) robustness, to remain effective across
diverse query variations. We propose a diverse query generation strategy that
simulates realistic linguistic variation in user queries, enabling the
discovery of prompts that generalize across paraphrases and rephrasings.
Building on this, a genetic algorithm-based joint optimization is developed to
evolve adversarial prompts by balancing attack success, clean-task utility, and
stealthiness. Experimental results show that AIP achieves up to 95.23% ASR
while preserving benign functionality. These findings uncover a critical and
previously overlooked vulnerability in RAG systems, emphasizing the need to
reassess the shared instructional prompts.
Authors' comments: Accepted at EMNLP 2025 Conference
Arda Kabadayi, Senem Velipasalar, Jiajing Chen
Compared to traditional image retrieval tasks, product retrieval in retail settings is even more challenging. Products of the same type from different brands may have highly similar visual appearances, and the query image may be taken from an angle that differs significantly from view angles of the stored catalog images. Foundational models, such as CLIP and SigLIP, often struggle to distinguish these subtle but important local differences. Pixel-wise matching methods, on the other hand, are computationally expensive and incur prohibitively high matching times. In this paper, we propose a new, hybrid method, called PRISM, for product retrieval in retail settings by leveraging the advantages of both vision-language model-based and pixel-wise matching approaches. To provide both efficiency/speed and finegrained retrieval accuracy, PRISM consists of three stages: 1) A vision-language model (SigLIP) is employed first to retrieve the top 35 most semantically similar products from a fixed gallery, thereby narrowing the search space significantly; 2) a segmentation model (YOLO-E) is applied to eliminate background clutter; 3) fine-grained pixel-level matching is performed using LightGlue across the filtered candidates. This framework enables more accurate discrimination between products with high inter-class similarity by focusing on subtle visual cues often missed by global models. Experiments performed on the ABV dataset show that our proposed PRISM outperforms the state-of-the-art image retrieval methods by 4.21% in top-1 accuracy while still remaining within the bounds of real-time processing for practical retail deployments.
Longfei Zuo, Pingjun Hong, Oliver Kraus, Barbara Plank, Robert Litschko
Multi-stage information retrieval (IR) has become a widely-adopted paradigm
in search. While Large Language Models (LLMs) have been extensively evaluated
as second-stage reranking models for monolingual IR, a systematic large-scale
comparison is still lacking for cross-lingual IR (CLIR). Moreover, while prior
work shows that LLM-based rerankers improve CLIR performance, their evaluation
setup relies on lexical retrieval with machine translation (MT) for the first
stage. This is not only prohibitively expensive but also prone to error
propagation across stages. Our evaluation on passage-level and document-level
CLIR reveals that further gains can be achieved with multilingual bi-encoders
as first-stage retrievers and that the benefits of translation diminishes with
stronger reranking models. We further show that pairwise rerankers based on
instruction-tuned LLMs perform competitively with listwise rerankers. To the
best of our knowledge, we are the first to study the interaction between
retrievers and rerankers in two-stage CLIR with LLMs. Our findings reveal that,
without MT, current state-of-the-art rerankers fall severely short when
directly applied in CLIR.
Authors' comments: Accepted at EMNLP 2025 (Findings)
Shangrong Wu, Yanghong Zhou, Yang Chen, Feng Zhang, P. Y. Mok
Image retrieval remains a fundamental yet challenging problem in computer vision. While recent advances in Multimodal Large Language Models (MLLMs) have demonstrated strong reasoning capabilities, existing methods typically employ them only for evaluation, without involving them directly in the ranking process. As a result, their rich multimodal reasoning abilities remain underutilized, leading to suboptimal performance. In this paper, we propose a novel Chain-of-Thought Re-Ranking (CoTRR) method to address this issue. Specifically, we design a listwise ranking prompt that enables MLLM to directly participate in re-ranking candidate images. This ranking process is grounded in an image evaluation prompt, which assesses how well each candidate aligns with users query. By allowing MLLM to perform listwise reasoning, our method supports global comparison, consistent reasoning, and interpretable decision-making - all of which are essential for accurate image retrieval. To enable structured and fine-grained analysis, we further introduce a query deconstruction prompt, which breaks down the original query into multiple semantic components. Extensive experiments on five datasets demonstrate the effectiveness of our CoTRR method, which achieves state-of-the-art performance across three image retrieval tasks, including text-to-image retrieval (TIR), composed image retrieval (CIR) and chat-based image retrieval (Chat-IR). Our code is available at https://github.com/freshfish15/CoTRR .
Suyuchen Wang, Jinlin Wang, Xinyu Wang, Shiqi Li, Xiangru Tang, Sirui Hong, Xiao-Wen Chang, Chenglin Wu et al.
Large language models (LLMs) often struggle with context fidelity, producing
inconsistent answers when responding to questions based on provided
information. Existing approaches either rely on expensive supervised
fine-tuning to generate evidence post-answer or train models to perform web
searches without necessarily improving utilization of the given context. We
propose CARE, a novel native retrieval-augmented reasoning framework that
teaches LLMs to explicitly integrate in-context evidence within their reasoning
process with the model's own retrieval capabilities. Our method requires
limited labeled evidence data while significantly enhancing both retrieval
accuracy and answer generation performance through strategically retrieved
in-context tokens in the reasoning chain. Extensive experiments on multiple
real-world and counterfactual QA benchmarks demonstrate that our approach
substantially outperforms supervised fine-tuning, traditional
retrieval-augmented generation methods, and external retrieval solutions. This
work represents a fundamental advancement in making LLMs more accurate,
reliable, and efficient for knowledge-intensive tasks.
Authors' comments: Accepted as a main conference paper at EMNLP 2025
Yifan Liu, Qianfeng Wen, Mark Zhao, Jiazhou Liang, Scott Sanner
Dense Passage Retrieval (DPR) typically relies on Euclidean or cosine
distance to measure query-passage relevance in embedding space, which is
effective when embeddings lie on a linear manifold. However, our experiments
across DPR benchmarks suggest that embeddings often lie on lower-dimensional,
non-linear manifolds, especially in out-of-distribution (OOD) settings, where
cosine and Euclidean distance fail to capture semantic similarity. To address
this limitation, we propose a manifold-aware distance metric for DPR (MA-DPR)
that models the intrinsic manifold structure of passages using a nearest
neighbor graph and measures query-passage distance based on their shortest path
in this graph. We show that MA-DPR outperforms Euclidean and cosine distances
by up to 26% on OOD passage retrieval with comparable in-distribution
performance across various embedding models while incurring a minimal increase
in query inference time. Empirical evidence suggests that manifold-aware
distance allows DPR to leverage context from related neighboring passages,
making it effective even in the absence of direct semantic overlap. MADPR can
be applied to a wide range of dense embedding and retrieval tasks, offering
potential benefits across a wide spectrum of domains.
Authors' comments: 19 pages, 8 figures
Gwendal Le Vaillant, Yannick Molle
Efficiently retrieving specific instrument timbres from audio mixtures remains a challenge in digital music production. This paper introduces a contrastive learning framework for musical instrument retrieval, enabling direct querying of instrument databases using a single model for both single- and multi-instrument sounds. We propose techniques to generate realistic positive/negative pairs of sounds for virtual musical instruments, such as samplers and synthesizers, addressing limitations in common audio data augmentation methods. The first experiment focuses on instrument retrieval from a dataset of 3,884 instruments, using single-instrument audio as input. Contrastive approaches are competitive with previous works based on classification pre-training. The second experiment considers multi-instrument retrieval with a mixture of instruments as audio input. In this case, the proposed contrastive framework outperforms related works, achieving 81.7\% top-1 and 95.7\% top-5 accuracies for three-instrument mixtures.