Eleni Kamateri, Renukswamy Chikkamath, Michail Salampasis, Linda Andersson, Markus Endres
Effective query formulation is a key challenge in long-document Information
Retrieval (IR). This challenge is particularly acute in domain-specific
contexts like patent retrieval, where documents are lengthy, linguistically
complex, and encompass multiple interrelated technical topics. In this work, we
present the application of recent extractive and abstractive summarization
methods for generating concise, purpose-specific summaries of patent documents.
We further assess the utility of these automatically generated summaries as
surrogate queries across three benchmark patent datasets and compare their
retrieval performance against conventional approaches that use entire patent
sections. Experimental results show that summarization-based queries
significantly improve prior-art retrieval effectiveness, highlighting their
potential as an efficient alternative to traditional query formulation
techniques.
Authors' comments: This version was submitted and accepted for publication at the 6th
Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech
2025), held in conjunction with SIGIR 2025. A revised and polished version,
incorporating reviewers' feedback, will follow
Pranav Kasela, Gabriella Pasi, Raffaele Perego
Academic Search is a search task aimed to manage and retrieve scientific
documents like journal articles and conference papers. Personalization in this
context meets individual researchers' needs by leveraging, through user
profiles, the user related information (e.g. documents authored by a
researcher), to improve search effectiveness and to reduce the information
overload. While citation graphs are a valuable means to support the outcome of
recommender systems, their use in personalized academic search (with, e.g.
nodes as papers and edges as citations) is still under-explored.
Existing personalized models for academic search often struggle to fully
capture users' academic interests. To address this, we propose a two-step
approach: first, training a neural language model for retrieval, then
converting the academic graph into a knowledge graph and embedding it into a
shared semantic space with the language model using translational embedding
techniques. This allows user models to capture both explicit relationships and
hidden structures in citation graphs and paper content. We evaluate our
approach in four academic search domains, outperforming traditional graph-based
and personalized models in three out of four, with up to a 10\% improvement in
MAP@100 over the second-best model. This highlights the potential of knowledge
graph-based user models to enhance retrieval effectiveness.
Authors' comments: Accepted in Information Systems. [17 May 2025]
https://doi.org/10.1016/j.is.2025.102574
Isaac Shi, Zeyuan Li, Fan Liu, Wenli Wang, Lewei He, Yang Yang, Tianyu Shi
We introduce the THOR (Transformer Heuristics for On-Demand Retrieval) Module, designed and implemented by eSapiens, a secure, scalable engine that transforms natural-language questions into verified, read-only SQL analytics for enterprise databases. The Text-to-SQL module follows a decoupled orchestration/execution architecture: a Supervisor Agent routes queries, Schema Retrieval dynamically injects table and column metadata, and a SQL Generation Agent emits single-statement SELECT queries protected by a read-only guardrail. An integrated Self-Correction & Rating loop captures empty results, execution errors, or low-quality outputs and triggers up to five LLM-driven regeneration attempts. Finally, a Result Interpretation Agent produces concise, human-readable insights and hands raw rows to the Insight & Intelligence engine for visualization or forecasting. Smoke tests across finance, sales, and operations scenarios demonstrate reliable ad-hoc querying and automated periodic reporting. By embedding schema awareness, fault-tolerant execution, and compliance guardrails, the THOR Module empowers non-technical users to access live data with zero-SQL simplicity and enterprise-grade safety.
Alexander Frummet, Emanuel Slany, Jonas Amling, Moritz Lang, Stephan Scheele
Conversational agents such as Microsoft Copilot and Google Gemini assist
users with complex search tasks but often generate misleading or fabricated
references. This undermines trust, particularly in high-stakes domains such as
medicine and finance. Explainable information retrieval (XIR) aims to address
this by making search results more transparent and interpretable. While most
XIR research is domain-agnostic, this paper focuses on auditing -- a critical
yet underexplored area. We argue that XIR systems can support auditors in
completing their complex task. We outline key challenges and future research
directions to advance XIR in this domain.
Authors' comments: Extended abstract accepted at the Workshop on Explainability in
Information Retrieval (WExIR), co-located with SIGIR 2025
Paul J. L. Ammann, Jonas Golde, Alan Akbik
Grounding large language models (LLMs) in verifiable external sources is a
well-established strategy for generating reliable answers. Retrieval-augmented
generation (RAG) is one such approach, particularly effective for tasks like
question answering: it retrieves passages that are semantically related to the
question and then conditions the model on this evidence. However, multi-hop
questions, such as "Which company among NVIDIA, Apple, and Google made the
biggest profit in 2023?," challenge RAG because relevant facts are often
distributed across multiple documents rather than co-occurring in one source,
making it difficult for standard RAG to retrieve sufficient information. To
address this, we propose a RAG pipeline that incorporates question
decomposition: (i) an LLM decomposes the original query into sub-questions,
(ii) passages are retrieved for each sub-question, and (iii) the merged
candidate pool is reranked to improve the coverage and precision of the
retrieved evidence. We show that question decomposition effectively assembles
complementary documents, while reranking reduces noise and promotes the most
relevant passages before answer generation. Although reranking itself is
standard, we show that pairing an off-the-shelf cross-encoder reranker with
LLM-driven question decomposition bridges the retrieval gap on multi-hop
questions and provides a practical, drop-in enhancement, without any extra
training or specialized indexing. We evaluate our approach on the MultiHop-RAG
and HotpotQA, showing gains in retrieval (MRR@10: +36.7%) and answer accuracy
(F1: +11.6%) over standard RAG baselines.
Authors' comments: Accepted to ACL SRW 2025. 9 Pages, 2 Figures, 4 Tables
Hanzhong Liang, Jinghao Shi, Xiang Shen, Zixuan Wang, Vera Wen, Ardalan Mehrani, Zhiqian Chen, Yifan Wu et al.
Video understanding plays a fundamental role for content moderation on short video platforms, enabling the detection of inappropriate content. While classification remains the dominant approach for content moderation, it often struggles in scenarios requiring rapid and cost-efficient responses, such as trend adaptation and urgent escalations. To address this issue, we introduce an Embedding-Based Retrieval (EBR) method designed to complement traditional classification approaches. We first leverage a Supervised Contrastive Learning (SCL) framework to train a suite of foundation embedding models, including both single-modal and multi-modal architectures. Our models demonstrate superior performance over established contrastive learning methods such as CLIP and MoCo. Building on these embedding models, we design and implement the embedding-based retrieval system that integrates embedding generation and video retrieval to enable efficient and effective trend handling. Comprehensive offline experiments on 25 diverse emerging trends show that EBR improves ROC-AUC from 0.85 to 0.99 and PR-AUC from 0.35 to 0.95. Further online experiments reveal that EBR increases action rates by 10.32% and reduces operational costs by over 80%, while also enhancing interpretability and flexibility compared to classification-based solutions.
Authors' comments: Camera ready for SIGIR 2025
Xuan Zhang, Ziyan Jiang, Rui Meng, Yifei Leng, Zhenbang Xiao, Zora Zhiruo Wang, Yanyi Shang, Dehan Kong
Trajectory data, capturing human actions and environmental states across
various modalities, holds significant potential for enhancing AI agent
capabilities, particularly in GUI environments. However, how to model the
representation of trajectory-level data presents a significant challenge that
has not been systematically addressed amid explosive trajectory data growth. In
this work, we introduce Multimodal Trajectory Retrieval, bridging the gap
between universal retrieval and agent-centric trajectory modeling. We construct
the Unified Agent Trajectory Dataset (UATD) from annotated demonstrations and
states across diverse real-world scenarios. Based on this, we present
GAE-Bench, a benchmark containing a large number of trajectory-based retrieval
pairs. In addition, we propose GAE-Retriever, a multimodal retrieval framework
that adopts vision-language models and incorporates optimized contrastive
learning through a token selection and the GradCache mechanism. Comprehensive
evaluations across multiple datasets show that GAE-Retriever consistently
outperforms strong baselines in retrieval recall, highlighting its
effectiveness in advancing multimodal trajectory retrieval.
Authors' comments: 18 pages, 3 figures, accepted by Workshop on Computer-use Agents @
ICML 2025
Zunran Wang, Zheng Shenpeng, Wang Shenglan, Minghui Zhao, Zhonghua Li
Hybrid-based retrieval methods, which unify dense-vector and lexicon-based retrieval, have garnered considerable attention in the industry due to performance enhancement. However, despite their promising results, the application of these hybrid paradigms in Chinese retrieval contexts has remained largely underexplored. In this paper, we introduce HyReC, an innovative end-to-end optimization method tailored specifically for hybrid-based retrieval in Chinese. HyReC enhances performance by integrating the semantic union of terms into the representation model. Additionally, it features the Global-Local-Aware Encoder (GLAE) to promote consistent semantic sharing between lexicon-based and dense retrieval while minimizing the interference between them. To further refine alignment, we incorporate a Normalization Module (NM) that fosters mutual benefits between the retrieval approaches. Finally, we evaluate HyReC on the C-MTEB retrieval benchmark to demonstrate its effectiveness.
Masaki Watabe, Joe Sakamoto, Hideaki Yoshimura, Tomomi Nemoto, Kazunari Kaizu
Although the transport of intensity equation (TIE) can be used to reconstruct
the spatial phase variations produced by samples such as magnetic materials and
biological cells, the impact of complex refractive indices on quantitative
phase imaging remains unexplored. To overcome this difficulty, we provide
herein a more physically generalized TIE framework that enables the
reconstruction of spatial variations in both refractive-index fluctuations and
attenuation coefficients. We then demonstrate this method using bright-field
microscopy imaging. The results reveal robust performance in retrieving
heterogeneous optical structures within measurable parameter regions. Finally,
we analyze the symmetry of the attenuation reversal in the TIE framework, thus
revealing the invariant nature of the absorptive and scattering properties in
the samples of interest.
Authors' comments: 6 pages, 3 figures in main text; 14 pages, 9 figures in the
supplemental material
Hyunsun Hong, Jongmoon Baik
Automated code review comment generation (RCG) aims to assist developers by automatically producing natural language feedback for code changes. Existing approaches are primarily either generation-based, using pretrained language models, or information retrieval-based (IR), reusing comments from similar past examples. While generation-based methods leverage code-specific pretraining on large code-natural language corpora to learn semantic relationships between code and natural language, they often struggle to generate low-frequency but semantically important tokens due to their probabilistic nature. In contrast, IR-based methods excel at recovering such rare tokens by copying from existing examples but lack flexibility in adapting to new code contexts-for example, when input code contains identifiers or structures not found in the retrieval database. To bridge the gap between generation-based and IR-based methods, this work proposes to leverage retrieval-augmented generation (RAG) for RCG by conditioning pretrained language models on retrieved code-review exemplars. By providing relevant examples that illustrate how similar code has been previously reviewed, the model is better guided to generate accurate review comments. Our evaluation on the Tufano et al. benchmark shows that RAG-based RCG outperforms both generation-based and IR-based RCG. It achieves up to +1.67% higher exact match and +4.25% higher BLEU scores compared to generation-based RCG. It also improves the generation of low-frequency ground-truth tokens by up to 24.01%. We additionally find that performance improves as the number of retrieved exemplars increases.
Xubo Qin, Jun Bai, Jiaqi Li, Zixia Jia, Zilong Zheng
Traditional information retrieval (IR) methods excel at textual and semantic matching but struggle in reasoning-intensive retrieval tasks that require multi-hop inference or complex semantic understanding between queries and documents. One promising solution is to explicitly rewrite or augment queries using large language models (LLMs) to elicit reasoning-relevant content prior to retrieval. However, the widespread use of large-scale language models like GPT-4 or LLaMA3-70B remains impractical due to their high inference cost and limited deployability in real-world systems. In this work, we introduce TongSearch QR (Previously Known as "TongSearch Reasoner"), a family of small-scale language models for query reasoning and rewriting in reasoning-intensive retrieval. With a novel semi-rule-based reward function, we employ reinforcement learning approaches enabling smaller language models, e,g, Qwen2.5-7B-Instruct and Qwen2.5-1.5B-Instruct, to achieve query reasoning performance rivaling large-scale language models without their prohibitive inference costs. Experiment results on BRIGHT benchmark show that with BM25 as retrievers, both TongSearch QR-7B and TongSearch QR-1.5B models significantly outperform existing baselines, including prompt-based query reasoners and some latest dense retrievers trained for reasoning-intensive retrieval tasks, offering superior adaptability for real-world deployment.
Amit Jaspal, Qian Dang, Ajantha Ramineni
Modern large-scale recommender systems employ multi-stage ranking funnel (Retrieval, Pre-ranking, Ranking) to balance engagement and computational constraints (latency, CPU). However, the initial retrieval stage, often relying on efficient but less precise methods like K-Nearest Neighbors (KNN), struggles to effectively surface the most engaging items from billion-scale catalogs, particularly distinguishing highly relevant and engaging candidates from merely relevant ones. We introduce Recall Augmentation through Deferred Asynchronous Retrieval (RADAR), a novel framework that leverages asynchronous, offline computation to pre-rank a significantly larger candidate set for users using the full complexity ranking model. These top-ranked items are stored and utilized as a high-quality retrieval source during online inference, bypassing online retrieval and pre-ranking stages for these candidates. We demonstrate through offline experiments that RADAR significantly boosts recall (2X Recall@200 vs DNN retrieval baseline) by effectively combining a larger retrieved candidate set with a more powerful ranking model. Online A/B tests confirm a +0.8% lift in topline engagement metrics, validating RADAR as a practical and effective method to improve recommendation quality under strict online serving constraints.
Weihang Su, Qingyao Ai, Jingtao Zhan, Qian Dong, Yiqun Liu
Retrieval-Augmented Generation (RAG) has become a foundational paradigm for equipping large language models (LLMs) with external knowledge, playing a critical role in information retrieval and knowledge-intensive applications. However, conventional RAG systems typically adopt a static retrieve-then-generate pipeline and rely on in-context knowledge injection, which can be suboptimal for complex tasks that require multihop reasoning, adaptive information access, and deeper integration of external knowledge. Motivated by these limitations, the research community has moved beyond static retrieval and in-context knowledge injection. Among the emerging directions, this tutorial delves into two rapidly growing and complementary research areas on RAG: Dynamic RAG and Parametric RAG. Dynamic RAG adaptively determines when and what to retrieve during the LLM's generation process, enabling real-time adaptation to the LLM's evolving information needs. Parametric RAG rethinks how retrieved knowledge should be injected into LLMs, transitioning from input-level to parameter-level knowledge injection for enhanced efficiency and effectiveness. This tutorial offers a comprehensive overview of recent advances in these emerging research areas. It also shares theoretical foundations and practical insights to support and inspire further research in RAG.
Jifei Luo, Wenzheng Wu, Hantao Yao, Lu Yu, Changsheng Xu
Diffusion-based re-ranking methods are effective in modeling the data
manifolds through similarity propagation in affinity graphs. However, positive
signals tend to diminish over several steps away from the source, reducing
discriminative power beyond local regions. To address this issue, we introduce
the Locality Preserving Markovian Transition (LPMT) framework, which employs a
long-term thermodynamic transition process with multiple states for accurate
manifold distance measurement. The proposed LPMT first integrates diffusion
processes across separate graphs using Bidirectional Collaborative Diffusion
(BCD) to establish strong similarity relationships. Afterwards, Locality State
Embedding (LSE) encodes each instance into a distribution for enhanced local
consistency. These distributions are interconnected via the Thermodynamic
Markovian Transition (TMT) process, enabling efficient global retrieval while
maintaining local effectiveness. Experimental results across diverse tasks
confirm the effectiveness of LPMT for instance retrieval.
Authors' comments: This paper has been accepted by ICML2025
Qiang Fu, Zonglei Jing, Zonghao Ying, Xiaoqian Li
The rapid progress of generative AI has enabled remarkable creative capabilities, yet it also raises urgent concerns regarding the safety of AI-generated visual content in real-world applications such as content moderation, platform governance, and digital media regulation. This includes unsafe material such as sexually explicit images, violent scenes, hate symbols, propaganda, and unauthorized imitations of copyrighted artworks. Existing image safety systems often rely on rigid category filters and produce binary outputs, lacking the capacity to interpret context or reason about nuanced, adversarially induced forms of harm. In addition, standard evaluation metrics (e.g., attack success rate) fail to capture the semantic severity and dynamic progression of toxicity. To address these limitations, we propose Perception-Retrieval-Judgement (PRJ), a cognitively inspired framework that models toxicity detection as a structured reasoning process. PRJ follows a three-stage design: it first transforms an image into descriptive language (perception), then retrieves external knowledge related to harm categories and traits (retrieval), and finally evaluates toxicity based on legal or normative rules (judgement). This language-centric structure enables the system to detect both explicit and implicit harms with improved interpretability and categorical granularity. In addition, we introduce a dynamic scoring mechanism based on a contextual toxicity risk matrix to quantify harmfulness across different semantic dimensions. Experiments show that PRJ surpasses existing safety checkers in detection accuracy and robustness while uniquely supporting structured category-level toxicity interpretation.
Wenzheng Zhang, Xi Victoria Lin, Karl Stratos, Wen-tau Yih, Mingda Chen
Retrieval-Augmented Generation (RAG) systems traditionally treat retrieval and generation as separate processes, requiring explicit textual queries to connect them. This separation can limit the ability of models to generalize across diverse tasks. In this work, we propose a query-free RAG system, named ImpRAG, which integrates retrieval and generation into a unified model. ImpRAG allows models to implicitly express their information needs, eliminating the need for human-specified queries. By dividing pretrained decoder-only language models into specialized layer groups, ImpRAG optimizes retrieval and generation tasks simultaneously. Our approach employs a two-stage inference process, using the same model parameters and forward pass for both retrieval and generation, thereby minimizing the disparity between retrievers and language models. Experiments on 8 knowledge-intensive tasks demonstrate that ImpRAG achieves 3.6-11.5 improvements in exact match scores on unseen tasks with diverse formats, highlighting its effectiveness in enabling models to articulate their own information needs and generalize across tasks. Our analysis underscores the importance of balancing retrieval and generation parameters and leveraging generation perplexities as retrieval training objectives for enhanced performance.
Xiaochen Wang, Zongyu Wu, Yuan Zhong, Xiang Zhang, Suhang Wang, Fenglong Ma
Graph retrieval-augmented generation (GRAG) places high demands on
graph-specific retrievers. However, existing retrievers often rely on language
models pretrained on plain text, limiting their effectiveness due to domain
misalignment and structure ignorance. To address these challenges, we propose
GPR, a graph-based retriever pretrained directly on knowledge graphs. GPR
aligns natural language questions with relevant subgraphs through LLM-guided
graph augmentation and employs a structure-aware objective to learn
fine-grained retrieval strategies. Experiments on two datasets, three LLM
backbones, and five baselines show that GPR consistently improves both
retrieval quality and downstream generation, demonstrating its effectiveness as
a robust retrieval solution for GRAG.
Authors' comments: Short paper submitted to EMNLP'25
Wenhao Ding, Sushant Veer, Yuxiao Chen, Yulong Cao, Chaowei Xiao, Marco Pavone
Learning-based planners generate natural human-like driving behaviors by learning to reason about nuanced interactions from data, overcoming the rigid behaviors that arise from rule-based planners. Nonetheless, data-driven approaches often struggle with rare, safety-critical scenarios and offer limited controllability over the generated trajectories. To address these challenges, we propose RealDrive, a Retrieval-Augmented Generation (RAG) framework that initializes a diffusion-based planning policy by retrieving the most relevant expert demonstrations from the training dataset. By interpolating between current observations and retrieved examples through a denoising process, our approach enables fine-grained control and safe behavior across diverse scenarios, leveraging the strong prior provided by the retrieved scenario. Another key insight we produce is that a task-relevant retrieval model trained with planning-based objectives results in superior planning performance in our framework compared to a task-agnostic retriever. Experimental results demonstrate improved generalization to long-tail events and enhanced trajectory diversity compared to standard learning-based planners -- we observe a 40% reduction in collision rate on the Waymo Open Motion dataset with RAG.
Fabio Fehr, Prabhu Teja Sivaprasad, Luca Franceschi, Giovanni Zappella
In this paper, we introduce CoRet, a dense retrieval model designed for
code-editing tasks that integrates code semantics, repository structure, and
call graph dependencies. The model focuses on retrieving relevant portions of a
code repository based on natural language queries such as requests to implement
new features or fix bugs. These retrieved code chunks can then be presented to
a user or to a second code-editing model or agent. To train CoRet, we propose a
loss function explicitly designed for repository-level retrieval. On SWE-bench
and Long Code Arena's bug localisation datasets, we show that our model
substantially improves retrieval recall by at least 15 percentage points over
existing models, and ablate the design choices to show their importance in
achieving these results.
Authors' comments: ACL 2025
Chunxu Liu, Chi Xie, Xiaxu Chen, Wei Li, Feng Zhu, Rui Zhao, Limin Wang
Text-to-Image Retrieval (T2IR) is a highly valuable task that aims to match a
given textual query to images in a gallery. Existing benchmarks primarily focus
on textual queries describing overall image semantics or foreground salient
objects, possibly overlooking inconspicuous small objects, especially in
complex environments. Such small object retrieval is crucial, as in real-world
applications, the targets of interest are not always prominent in the image.
Thus, we introduce SORCE (Small Object Retrieval in Complex Environments), a
new subfield of T2IR, focusing on retrieving small objects in complex images
with textual queries. We propose a new benchmark, SORCE-1K, consisting of
images with complex environments and textual queries describing less
conspicuous small objects with minimal contextual cues from other salient
objects. Preliminary analysis on SORCE-1K finds that existing T2IR methods
struggle to capture small objects and encode all the semantics into a single
embedding, leading to poor retrieval performance on SORCE-1K. Therefore, we
propose to represent each image with multiple distinctive embeddings. We
leverage Multimodal Large Language Models (MLLMs) to extract multiple
embeddings for each image instructed by a set of Regional Prompts (ReP).
Experimental results show that our multi-embedding approach through MLLM and
ReP significantly outperforms existing T2IR methods on SORCE-1K. Our
experiments validate the effectiveness of SORCE-1K for benchmarking SORCE
performances, highlighting the potential of multi-embedding representation and
text-customized MLLM features for addressing this task.
Authors' comments: Project Page: https://github.com/MCG-NJU/SORCE