Zhibo Fan, Hongtao Lin, Haoyu Chen, Bowen Deng, Hedi Xia, Yuke Yan, James Li
Industrial recommendation systems are typically composed of multiple stages,
including retrieval, ranking, and blending. The retrieval stage plays a
critical role in generating a high-recall set of candidate items that covers a
wide range of diverse user interests. Effectively covering the diverse and
long-tail user interests within this stage poses a significant challenge:
traditional two-tower models struggle in this regard due to limited user-item
feature interaction and often bias towards top use cases. To address these
issues, we propose a novel multi-embedding retrieval framework designed to
enhance user interest representation by generating multiple user embeddings
conditioned on both implicit and explicit user interests. Implicit interests
are captured from user history through a Differentiable Clustering Module
(DCM), whereas explicit interests, such as topics that the user has followed,
are modeled via Conditional Retrieval (CR). These methodologies represent a
form of conditioned user representation learning that involves condition
representation construction and associating the target item with the relevant
conditions. Synergizing implicit and explicit user interests serves as a
complementary approach to achieve more effective and comprehensive candidate
retrieval as they benefit on different user segments and extract conditions
from different but supplementary sources. Extensive experiments and A/B testing
reveal significant improvements in user engagements and feed diversity metrics.
Our proposed framework has been successfully deployed on Pinterest home feed.
Authors' comments: KDD 2025
Yongsheng Lian
We present Machine Assistant with Reliable Knowledge (MARK), a retrieval-augmented question-answering system designed to support student learning through accurate and contextually grounded responses. The system is built on a retrieval-augmented generation (RAG) framework, which integrates a curated knowledge base to ensure factual consistency. To enhance retrieval effectiveness across diverse question types, we implement a hybrid search strategy that combines dense vector similarity with sparse keyword-based retrieval. This dual-retrieval mechanism improves robustness for both general and domain-specific queries. The system includes a feedback loop in which students can rate responses and instructors can review and revise them. Instructor corrections are incorporated into the retrieval corpus, enabling adaptive refinement over time. The system was deployed in a classroom setting as a substitute for traditional office hours, where it successfully addressed a broad range of student queries. It was also used to provide technical support by integrating with a customer-specific knowledge base, demonstrating its ability to handle routine, context-sensitive tasks in applied domains. MARK is publicly accessible at https://app.eduquery.ai.
Mai A. Shaaban, Tausifa Jan Saleem, Vijay Ram Papineni, Mohammad Yaqub
Medical visual question answering (MedVQA) plays a vital role in clinical decision-making by providing contextually rich answers to image-based queries. Although vision-language models (VLMs) are widely used for this task, they often generate factually incorrect answers. Retrieval-augmented generation addresses this challenge by providing information from external sources, but risks retrieving irrelevant context, which can degrade the reasoning capabilities of VLMs. Re-ranking retrievals, as introduced in existing approaches, enhances retrieval relevance by focusing on query-text alignment. However, these approaches neglect the visual or multimodal context, which is particularly crucial for medical diagnosis. We propose MOTOR, a novel multimodal retrieval and re-ranking approach that leverages grounded captions and optimal transport. It captures the underlying relationships between the query and the retrieved context based on textual and visual information. Consequently, our approach identifies more clinically relevant contexts to augment the VLM input. Empirical analysis and human expert evaluation demonstrate that MOTOR achieves higher accuracy on MedVQA datasets, outperforming state-of-the-art methods by an average of 6.45%. Code is available at https://github.com/BioMedIA-MBZUAI/MOTOR.
Chase Fensore, Kaustubh Dhole, Joyce C Ho, Eugene Agichtein
We present our submission to the LiveRAG Challenge 2025, which evaluates
retrieval-augmented generation (RAG) systems on dynamic test sets using the
FineWeb-10BT corpus. Our final hybrid approach combines sparse (BM25) and dense
(E5) retrieval methods and then aims to generate relevant and faithful answers
with Falcon3-10B-Instruct. Through systematic evaluation on 200 synthetic
questions generated with DataMorgana across 64 unique question-user
combinations, we demonstrate that neural re-ranking with RankLLaMA improves MAP
from 0.523 to 0.797 (52% relative improvement) but introduces prohibitive
computational costs (84s vs 1.74s per question). While DSPy-optimized prompting
strategies achieved higher semantic similarity (0.771 vs 0.668), their 0%
refusal rates raised concerns about over-confidence and generalizability. Our
submitted hybrid system without re-ranking achieved 4th place in faithfulness
and 11th place in correctness among 25 teams. Analysis across question
categories reveals that vocabulary alignment between questions and documents
was the strongest predictor of performance on our development set, with
document-similar phrasing improving cosine similarity from 0.562 to 0.762.
Authors' comments: 4 pages, 3 tables, 2 figures. Accepted at the SIGIR LiveRAG Workshop
2025 (Submission 2664)
Weronika Łajewska, Ivica Kostric, Gabriel Iturra-Bocaz, Mariam Arustashvili, Krisztian Balog
Retrieval-augmented generation (RAG) faces challenges related to factual correctness, source attribution, and response completeness. The LiveRAG Challenge hosted at SIGIR'25 aims to advance RAG research using a fixed corpus and a shared, open-source LLM. We propose a modular pipeline that operates on information nuggets-minimal, atomic units of relevant information extracted from retrieved documents. This multistage pipeline encompasses query rewriting, passage retrieval and reranking, nugget detection and clustering, cluster ranking and summarization, and response fluency enhancement. This design inherently promotes grounding in specific facts, facilitates source attribution, and ensures maximum information inclusion within length constraints. In this challenge, we extend our focus to also address the retrieval component of RAG, building upon our prior work on multi-faceted query rewriting. Furthermore, for augmented generation, we concentrate on improving context curation capabilities, maximizing the breadth of information covered in the response while ensuring pipeline efficiency. Our results show that combining original queries with a few sub-query rewrites boosts recall, while increasing the number of documents used for reranking and generation beyond a certain point reduces effectiveness, without improving response quality.
Iliass Ayaou, Denis Cavallucci, Hicham Chibane
In the landscape of publicly available patent retrieval datasets, the need for explicit indomain and out-of-domain labeling, multi-jurisdiction coverage, balanced query domain representation and manageable sizes that support sub document level experiments on moderate computational resources is often overlooked. To address these gaps, we propose DAPFAM, a new open access domain-aware patent retrieval dataset constructed at the simple-family level. The dataset contains 1,247 domain balanced full text query families and 45,336 full text target families. The dataset is enriched by clear relevance judgments (forward/backward citations as positive links, random negatives), as well as explicit in-domain or out-of-domain relationships via a novel proposed labelling scheme based on via International Patent Classification (IPC) codes, resulting in 49,869 evaluation pairs. The dataset is multi jurisdictional, requires little to no preprocessing for retrieval evaluation, and remains of a size manageable for entities with limited ressources allowing for sub document level retrieval experiments without excessive computational costs. We describe our three-step data-curation pipeline, present comprehensive dataset statistics, and provide baseline experiments using lexical and neural retrieval methods. Our baseline experiments highlight significant challenges in crossdomain patent retrieval. The dataset will be publicly available (for now the access link is this repository: https://osf.io/vbyzd/?view_only=1a40242e0d1941a58aa854af3e50cf6b).
Najmeh Forouzandehmehr, Reza Yousefi Maragheh, Sriram Kollipara, Kai Zhao, Topojoy Biswas, Evren Korpeoglu, Kannan Achan
Automated content-aware layout generation -- the task of arranging visual elements such as text, logos, and underlays on a background canvas -- remains a fundamental yet under-explored problem in intelligent design systems. While recent advances in deep generative models and large language models (LLMs) have shown promise in structured content generation, most existing approaches lack grounding in contextual design exemplars and fall short in handling semantic alignment and visual coherence. In this work we introduce CAL-RAG, a retrieval-augmented, agentic framework for content-aware layout generation that integrates multimodal retrieval, large language models, and collaborative agentic reasoning. Our system retrieves relevant layout examples from a structured knowledge base and invokes an LLM-based layout recommender to propose structured element placements. A vision-language grader agent evaluates the layout with visual metrics, and a feedback agent provides targeted refinements, enabling iterative improvement. We implement our framework using LangGraph and evaluate it on the PKU PosterLayout dataset, a benchmark rich in semantic and structural variability. CAL-RAG achieves state-of-the-art performance across multiple layout metrics -- including underlay effectiveness, element alignment, and overlap -- substantially outperforming strong baselines such as LayoutPrompter. These results demonstrate that combining retrieval augmentation with agentic multi-step reasoning yields a scalable, interpretable, and high-fidelity solution for automated layout generation.
Thuy T. Le, Phuong M. Nguyen, Loc H. Nguyen
This paper addresses the challenging and interesting inverse problem of reconstructing the spatially varying dielectric constant of a medium from phaseless backscattering measurements generated by single-point illumination. The underlying mathematical model is governed by the three-dimensional Helmholtz equation, and the available data consist solely of the magnitude of the scattered wave field. To address the nonlinearity and severe ill-posedness of this phaseless inverse scattering problem, we introduce a robust, globally convergent numerical framework combining several key regularization strategies. Our method first employs a phase retrieval step based on the Wentzel--Kramers--Brillouin (WKB) ansatz, where the lost phase information is reconstructed by solving a nonlinear optimization problem. Subsequently, we implement a Fourier-based dimension reduction technique, transforming the original problem into a more stable system of elliptic equations with Cauchy boundary conditions. To solve this resulting system reliably, we apply the Carleman convexification approach, constructing a strictly convex weighted cost functional whose global minimizer provides an accurate approximation of the true solution. Numerical simulations using synthetic data with high noise levels demonstrate the effectiveness and robustness of the proposed method, confirming its capability to accurately recover both the geometric location and contrast of hidden scatterers.
Yongchan Chun, Minhyuk Kim, Dongjun Kim, Chanjun Park, Heuiseok Lim
Automatic Term Extraction (ATE) identifies domain-specific expressions that are crucial for downstream tasks such as machine translation and information retrieval. Although large language models (LLMs) have significantly advanced various NLP tasks, their potential for ATE has scarcely been examined. We propose a retrieval-based prompting strategy that, in the few-shot setting, selects demonstrations according to \emph{syntactic} rather than semantic similarity. This syntactic retrieval method is domain-agnostic and provides more reliable guidance for capturing term boundaries. We evaluate the approach in both in-domain and cross-domain settings, analyzing how lexical overlap between the query sentence and its retrieved examples affects performance. Experiments on three specialized ATE benchmarks show that syntactic retrieval improves F1-score. These findings highlight the importance of syntactic cues when adapting LLMs to terminology-extraction tasks.
Qinwen Chen, Wenbiao Tao, Zhiwei Zhu, Mingfan Xi, Liangzhong Guo, Yuan Wang, Wei Wang, Yunshi Lan
Community Question Answering (CQA) platforms can be deemed as important
knowledge bases in community, but effectively leveraging historical
interactions and domain knowledge in real-time remains a challenge. Existing
methods often underutilize external knowledge, fail to incorporate dynamic
historical QA context, or lack memory mechanisms suited for industrial
deployment. We propose ComRAG, a retrieval-augmented generation framework for
real-time industrial CQA that integrates static knowledge with dynamic
historical QA pairs via a centroid-based memory mechanism designed for
retrieval, generation, and efficient storage. Evaluated on three industrial CQA
datasets, ComRAG consistently outperforms all baselines--achieving up to 25.9%
improvement in vector similarity, reducing latency by 8.7% to 23.3%, and
lowering chunk growth from 20.23% to 2.06% over iterations.
Authors' comments: 7 pages, 4 figures. Accepted at ACL 2025 Industry Track
Naihe Feng, Yi Sui, Shiyi Hou, Jesse C. Cresswell, Ga Wu
Existing research on Retrieval-Augmented Generation (RAG) primarily focuses
on improving overall question-answering accuracy, often overlooking the quality
of sub-claims within generated responses. Recent methods that attempt to
improve RAG trustworthiness, such as through auto-evaluation metrics, lack
probabilistic guarantees or require ground truth answers. To address these
limitations, we propose Conformal-RAG, a novel framework inspired by recent
applications of conformal prediction (CP) on large language models (LLMs).
Conformal-RAG leverages CP and internal information from the RAG mechanism to
offer statistical guarantees on response quality. It ensures group-conditional
coverage spanning multiple sub-domains without requiring manual labelling of
conformal sets, making it suitable for complex RAG applications. Compared to
existing RAG auto-evaluation methods, Conformal-RAG offers statistical
guarantees on the quality of refined sub-claims, ensuring response reliability
without the need for ground truth answers. Additionally, our experiments
demonstrate that by leveraging information from the RAG system, Conformal-RAG
retains up to 60\% more high-quality sub-claims from the response compared to
direct applications of CP to LLMs, while maintaining the same reliability
guarantee.
Authors' comments: Accepted by SIGIR 2025 short paper, 5 pages, Code is available at
https://github.com/n4feng/ResponseQualityAssessment
Xingyu Deng, Xi Wang, Mark Stevenson
Scientific fact-checking aims to determine the veracity of scientific claims
by retrieving and analysing evidence from research literature. The problem is
inherently more complex than general fact-checking since it must accommodate
the evolving nature of scientific knowledge, the structural complexity of
academic literature and the challenges posed by long-form, multimodal
scientific expression. However, existing approaches focus on simplified
versions of the problem based on small-scale datasets consisting of abstracts
rather than full papers, thereby avoiding the distinct challenges associated
with processing complete documents. This paper examines the limitations of
current scientific fact-checking systems and reveals the many potential
features and resources that could be exploited to advance their performance. It
identifies key research challenges within evidence retrieval, including (1)
evidence-driven retrieval that addresses semantic limitations and topic
imbalance (2) time-aware evidence retrieval with citation tracking to mitigate
outdated information, (3) structured document parsing to leverage long-range
context, (4) handling complex scientific expressions, including tables,
figures, and domain-specific terminology and (5) assessing the credibility of
scientific literature. Preliminary experiments were conducted to substantiate
these challenges and identify potential solutions. This perspective paper aims
to advance scientific fact-checking with a specialised IR system tailored for
real-world applications.
Authors' comments: Accepted for ACM SIGIR Conference on Innovative Concepts and Theories
in Information Retrieval (ICTIR'25)
Chinmay Gondhalekar, Urjitkumar Patel, Fang-Chun Yeh
Financial documents--such as 10-Ks, 10-Qs, and investor presentations--span
hundreds of pages and combine diverse modalities, including dense narrative
text, structured tables, and complex figures. Answering questions over such
content often requires joint reasoning across modalities, which strains
traditional large language models (LLMs) and retrieval-augmented generation
(RAG) pipelines due to token limitations, layout loss, and fragmented
cross-modal context. We introduce MultiFinRAG, a retrieval-augmented generation
framework purpose-built for financial QA. MultiFinRAG first performs multimodal
extraction by grouping table and figure images into batches and sending them to
a lightweight, quantized open-source multimodal LLM, which produces both
structured JSON outputs and concise textual summaries. These outputs, along
with narrative text, are embedded and indexed with modality-aware similarity
thresholds for precise retrieval. A tiered fallback strategy then dynamically
escalates from text-only to text+table+image contexts when necessary, enabling
cross-modal reasoning while reducing irrelevant context. Despite running on
commodity hardware, MultiFinRAG achieves 19 percentage points higher accuracy
than ChatGPT-4o (free-tier) on complex financial QA tasks involving text,
tables, images, and combined multimodal reasoning.
Authors' comments: Preprint Copy
Ali Tourani, Fatemeh Nazary, Yashar Deldjoo
This paper addresses the challenge of developing multimodal recommender
systems for the movie domain, where limited metadata (e.g., title, genre) often
hinders the generation of robust recommendations. We introduce a resource that
combines LLM-generated plot descriptions with trailer-derived visual embeddings
in a unified pipeline supporting both Retrieval-Augmented Generation (RAG) and
collaborative filtering. Central to our approach is a data augmentation step
that transforms sparse metadata into richer textual signals, alongside fusion
strategies (e.g., PCA, CCA) that integrate visual cues. Experimental
evaluations demonstrate that CCA-based fusion significantly boosts recall
compared to unimodal baselines, while an LLM-driven re-ranking step further
improves NDCG, particularly in scenarios with limited textual data. By
releasing this framework, we invite further exploration of multi-modal
recommendation techniques tailored to cold-start, novelty-focused, and
domain-specific settings. All code, data, and detailed documentation are
publicly available at: https://github.com/RecSys-lab/RAG-VisualRec
Authors' comments: 20 pages, 6 figures, 5 tables
Dhruv Gupta, Gayathri Ganesh Lakshmy, Yiqing Xie
Retrieval-Augmented Code Generation (RACG) is a critical technique for enhancing code generation by retrieving relevant information. In this work, we conduct an in-depth analysis of code retrieval by systematically masking specific features while preserving code functionality. Our discoveries include: (1) although trained on code, current retrievers heavily rely on surface-level textual features (e.g., docstrings, identifier names), and (2) they exhibit a strong bias towards well-documented code, even if the documentation is irrelevant.Based on our discoveries, we propose SACL, a framework that enriches textual information and reduces bias by augmenting code or structural knowledge with semantic information. Extensive experiments show that SACL substantially improves code retrieval (e.g., by 12.8% / 9.4% / 7.0% Recall@1 on HumanEval / MBPP / SWE-Bench-Lite), which also leads to better code generation performance (e.g., by 4.88% Pass@1 on HumanEval).
KMA Solaiman, Bharat Bhargava
Existing multi-media retrieval models either rely on creating a common
subspace with modality-specific representation models or require schema mapping
among modalities to measure similarities among multi-media data. Our goal is to
avoid the annotation overhead incurred from considering retrieval as a
supervised classification task and re-use the pretrained encoders in large
language models and vision tasks. We propose "FemmIR", a framework to retrieve
multimodal results relevant to information needs expressed with multimodal
queries by example without any similarity label. Such identification is
necessary for real-world applications where data annotations are scarce and
satisfactory performance is required without fine-tuning with a common
framework across applications. We curate a new dataset called MuQNOL for
benchmarking progress on this task. Our technique is based on weak supervision
introduced through edit distance between samples: graph edit distance can be
modified to consider the cost of replacing a data sample in terms of its
properties, and relevance can be measured through the implicit signal from the
amount of edit cost among the objects. Unlike metric learning or encoding
networks, FemmIR re-uses the high-level properties and maintains the property
value and relationship constraints with a multi-level interaction score between
data samples and the query example provided by the user. We empirically
evaluate FemmIR on a missing person use case with MuQNOL. FemmIR performs
comparably to similar retrieval systems in delivering on-demand retrieval
results with exact and approximate similarities while using the existing
property identifiers in the system.
Authors' comments: Submitted to ICDE'24. An earlier version of this paper appeared on
TechRxiv: https://www.techrxiv.org/doi/full/10.36227/techrxiv.21990284.v1,
uploaded on February 05, 2023
Shenbin Qian, Diptesh Kanojia, Samarth Agrawal, Hadeel Saadany, Swapnil Bhosale, Constantin Orasan, Zhe Wu
E-commerce information retrieval (IR) systems struggle to simultaneously
achieve high accuracy in interpreting complex user queries and maintain
efficient processing of vast product catalogs. The dual challenge lies in
precisely matching user intent with relevant products while managing the
computational demands of real-time search across massive inventories. In this
paper, we propose a Nested Embedding Approach to product Retrieval and Ranking,
called NEAR$^2$, which can achieve up to $12$ times efficiency in embedding
size at inference time while introducing no extra cost in training and
improving performance in accuracy for various encoder-based Transformer models.
We validate our approach using different loss functions for the retrieval and
ranking task, including multiple negative ranking loss and online contrastive
loss, on four different test sets with various IR challenges such as short and
implicit queries. Our approach achieves an improved performance over a smaller
embedding dimension, compared to any existing models.
Authors' comments: This paper is accepted to the 2025 SIGIR Workshop on eCommerce
Ashish Chouhan, Michael Gertz
This paper presents the approach of our team called heiDS for the ArchEHR-QA
2025 shared task. A pipeline using a retrieval augmented generation (RAG)
framework is designed to generate answers that are attributed to clinical
evidence from the electronic health records (EHRs) of patients in response to
patient-specific questions. We explored various components of a RAG framework,
focusing on ranked list truncation (RLT) retrieval strategies and attribution
approaches. Instead of using a fixed top-k RLT retrieval strategy, we employ a
query-dependent-k retrieval strategy, including the existing surprise and
autocut methods and two new methods proposed in this work, autocut* and elbow.
The experimental results show the benefits of our strategy in producing factual
and relevant answers when compared to a fixed-$k$.
Authors' comments: 12 pages, 2 figures, 6 tables, Workshop on BioNLP and Shared Tasks at
ACL 2025
Wanli Peng, Xin Chen, Hang Fu, XinYu He, Xue Yiming, Juan Wen
Recent advances in large language models (LLMs) have made a profound impact on our society and also raised new security concerns. Particularly, due to the remarkable inference ability of LLMs, the privacy violation attack (PVA), revealed by Staab et al., introduces serious personal privacy issues. Existing defense methods mainly leverage LLMs to anonymize the input query, which requires costly inference time and cannot gain satisfactory defense performance. Moreover, directly rejecting the PVA query seems like an effective defense method, while the defense method is exposed, promoting the evolution of PVA. In this paper, we propose a novel defense paradigm based on retrieval-confused generation (RCG) of LLMs, which can efficiently and covertly defend the PVA. We first design a paraphrasing prompt to induce the LLM to rewrite the "user comments" of the attack query to construct a disturbed database. Then, we propose the most irrelevant retrieval strategy to retrieve the desired user data from the disturbed database. Finally, the "data comments" are replaced with the retrieved user data to form a defended query, leading to responding to the adversary with some wrong personal attributes, i.e., the attack fails. Extensive experiments are conducted on two datasets and eight popular LLMs to comprehensively evaluate the feasibility and the superiority of the proposed defense method.
Xiao Cheng, Zhihao Guo, Huan Huo, Yulei Sui
Memory-related errors in C programming continue to pose significant challenges in software development, primarily due to the complexities of manual memory management inherent in the language. These errors frequently serve as vectors for severe vulnerabilities, while their repair requires extensive knowledge of program logic and C's memory model. Automated Program Repair (APR) has emerged as a critical research area to address these challenges. Traditional APR approaches rely on expert-designed strategies and predefined templates, which are labor-intensive and constrained by the effectiveness of manual specifications. Deep learning techniques offer a promising alternative by automatically extracting repair patterns, but they require substantial training datasets and often lack interpretability. This paper introduces LTFix, a novel approach that harnesses the potential of Large Language Models (LLMs) for automated memory error repair, especially for complex repository-level errors that span multiple functions and files. We address two fundamental challenges in LLM-based memory error repair: a limited understanding of interprocedural memory management patterns and context window limitations for repository-wide analysis. Our approach utilizes a finite typestate automaton to guide the tracking of error-propagation paths and context trace, capturing both spatial (memory states) and temporal (execution history) dimensions of error behavior. This typestate-guided context retrieval strategy provides the LLM with concise yet semantically rich information relevant to erroneous memory management, effectively addressing the token limitation of LLMs.