Kevin Dela Rosa
We introduce ModaRoute, an LLM-based intelligent routing system that dynamically selects optimal modalities for multimodal video retrieval. While dense text captions can achieve 75.9% Recall@5, they require expensive offline processing and miss critical visual information present in 34% of clips with scene text not captured by ASR. By analyzing query intent and predicting information needs, ModaRoute reduces computational overhead by 41% while achieving 60.9% Recall@5. Our approach uses GPT-4.1 to route queries across ASR (speech), OCR (text), and visual indices, averaging 1.78 modalities per query versus exhaustive 3.0 modality search. Evaluation on 1.8M video clips demonstrates that intelligent routing provides a practical solution for scaling multimodal retrieval systems, reducing infrastructure costs while maintaining competitive effectiveness for real-world deployment.
Authors' comments: Accepted to ICCV 2025 Multimodal Representation and Retrieval Workshop
Jack McKechnie, Graham McDonald, Craig Macdonald
The evaluation of Information Retrieval (IR) systems typically uses query-document pairs with corresponding human-labelled relevance assessments (qrels). These qrels are used to determine if one system is better than another based on average retrieval performance. Acquiring large volumes of human relevance assessments is expensive. Therefore, more efficient relevance assessment approaches have been proposed, necessitating comparisons between qrels to ascertain their efficacy. Discriminative power, i.e. the ability to correctly identify significant differences between systems, is important for drawing accurate conclusions on the robustness of qrels. Previous work has measured the proportion of pairs of systems that are identified as significantly different and has quantified Type I statistical errors. Type I errors lead to incorrect conclusions due to false positive significance tests. We argue that also identifying Type II errors (false negatives) is important as they lead science in the wrong direction. We quantify Type II errors and propose that balanced classification metrics, such as balanced accuracy, can be used to portray the discriminative power of qrels. We perform experiments using qrels generated using alternative relevance assessment methods to investigate measuring hypothesis testing errors in IR evaluation. We find that additional insights into the discriminative power of qrels can be gained by quantifying Type II errors, and that balanced classification metrics can be used to give an overall summary of discriminative power in one, easily comparable, number.
Anant Gupta, Rajarshi Bhowmik, Geoffrey Gunow
Tracking the strategic focus of companies through topics in their earnings
calls is a key task in financial analysis. However, as industries evolve,
traditional topic modeling techniques struggle to dynamically capture emerging
topics and their relationships. In this work, we propose an LLM-agent driven
approach to discover and retrieve emerging topics from quarterly earnings
calls. We propose an LLM-agent to extract topics from documents, structure them
into a hierarchical ontology, and establish relationships between new and
existing topics through a topic ontology. We demonstrate the use of extracted
topics to infer company-level insights and emerging trends over time. We
evaluate our approach by measuring ontology coherence, topic evolution
accuracy, and its ability to surface emerging financial trends.
Authors' comments: The 2nd Workshop on Financial Information Retrieval in the Era of
Generative AI, The 48th International ACM SIGIR Conference on Research and
Development in Information Retrieval July 13-17, 2025 | Padua, Italy
Abhinav Java, Srivathsan Koundinyan, Nagarajan Natarajan, Amit Sharma
We consider the problem of answering complex questions, given access to a
large unstructured document corpus. The de facto approach to solving the
problem is to leverage language models that (iteratively) retrieve and reason
through the retrieved documents, until the model has sufficient information to
generate an answer. Attempts at improving this approach focus on
retrieval-augmented generation (RAG) metrics such as accuracy and recall and
can be categorized into two types: (a) fine-tuning on large question answering
(QA) datasets augmented with chain-of-thought traces, and (b) leveraging
RL-based fine-tuning techniques that rely on question-document relevance
signals. However, efficiency in the number of retrieval searches is an equally
important metric, which has received less attention. In this work, we show
that: (1) Large-scale fine-tuning is not needed to improve RAG metrics,
contrary to popular claims in recent literature. Specifically, a standard ReAct
pipeline with improved prompts can outperform state-of-the-art methods on
benchmarks such as HotPotQA. (2) Supervised and RL-based fine-tuning can help
RAG from the perspective of frugality, i.e., the latency due to number of
searches at inference time. For example, we show that we can achieve
competitive RAG metrics at nearly half the cost (in terms of number of
searches) on popular RAG benchmarks, using the same base model, and at a small
training cost (1000 examples).
Authors' comments: Accepted at ICML Workshop: Efficient Systems for Foundation Models
Dahyun Lee, Yongrae Jo, Haeju Park, Moontae Lee
Retrieval in Retrieval-Augmented Generation(RAG) must ensure that retrieved
passages are not only individually relevant but also collectively form a
comprehensive set. Existing approaches primarily rerank top-k passages based on
their individual relevance, often failing to meet the information needs of
complex queries in multi-hop question answering. In this work, we propose a
set-wise passage selection approach and introduce SETR, which explicitly
identifies the information requirements of a query through Chain-of-Thought
reasoning and selects an optimal set of passages that collectively satisfy
those requirements. Experiments on multi-hop RAG benchmarks show that SETR
outperforms both proprietary LLM-based rerankers and open-source baselines in
terms of answer correctness and retrieval quality, providing an effective and
efficient alternative to traditional rerankers in RAG systems. The code is
available at https://github.com/LGAI-Research/SetR
Authors' comments: Accepted to ACL 2025 Oral
SeungYoon Han, Taeho Hwang, Sukmin Cho, Soyeong Jeong, Hoyun Song, Huije Lee, Jong C. Park
The rapid expansion of digital information and knowledge across structured and unstructured sources has heightened the importance of Information Retrieval (IR). While dense retrieval methods have substantially improved semantic matching for general queries, they consistently underperform on queries with explicit temporal constraints--often those containing numerical expressions and time specifiers such as ``in 2015.'' Existing approaches to Temporal Information Retrieval (TIR) improve temporal reasoning but often suffer from catastrophic forgetting, leading to reduced performance on non-temporal queries. To address this, we propose Time-Specifier Model Merging (TSM), a novel method that enhances temporal retrieval while preserving accuracy on non-temporal queries. TSM trains specialized retrievers for individual time specifiers and merges them in to a unified model, enabling precise handling of temporal constraints without compromising non-temporal retrieval. Extensive experiments on both temporal and non-temporal datasets demonstrate that TSM significantly improves performance on temporally constrained queries while maintaining strong results on non-temporal queries, consistently outperforming other baseline methods. Our code is available at https://github.com/seungyoonee/TSM .
Tianzhe Zhao, Jiaoyan Chen, Yanchi Ru, Haiping Zhu, Nan Hu, Jun Liu, Qika Lin
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by retrieving external data to mitigate hallucinations and outdated knowledge issues. Benefiting from the strong ability in facilitating diverse data sources and supporting faithful reasoning, knowledge graphs (KGs) have been increasingly adopted in RAG systems, giving rise to KG-based RAG (KG-RAG) methods. Though RAG systems are widely applied in various applications, recent studies have also revealed its vulnerabilities to data poisoning attacks, where malicious information injected into external knowledge sources can mislead the system into producing incorrect or harmful responses. However, these studies focus exclusively on RAG systems using unstructured textual data sources, leaving the security risks of KG-RAG largely unexplored, despite the fact that KGs present unique vulnerabilities due to their structured and editable nature. In this work, we conduct the first systematic investigation of the security issue of KG-RAG methods through data poisoning attacks. To this end, we introduce a practical, stealthy attack setting that aligns with real-world implementation. We propose an attack strategy that first identifies adversarial target answers and then inserts perturbation triples to complete misleading inference chains in the KG, increasing the likelihood that KG-RAG methods retrieve and rely on these perturbations during generation. Through extensive experiments on two benchmarks and four recent KG-RAG methods, our attack strategy demonstrates strong effectiveness in degrading KG-RAG performance, even with minimal KG perturbations. In-depth analyses are also conducted to understand the safety threats within the internal stages of KG-RAG systems and to explore the robustness of LLMs against adversarial knowledge.
Authors' comments: 13 pages, 6 figures
Alessandro Umbrico, Guido Bologna, Luca Coraci, Francesca Fracasso, Silvia Gola, Gabriella Cortellessa
The lack of transparency of data-driven Artificial Intelligence techniques limits their interpretability and acceptance into healthcare decision-making processes. We propose an attribution-based approach to improve the interpretability of Explainable AI-based predictions in the specific context of arm lymphedema's risk assessment after lymph nodal radiotherapy in breast cancer. The proposed method performs a statistical analysis of the attributes in the rule-based prediction model using standard metrics from Information Retrieval techniques. This analysis computes the relevance of each attribute to the prediction and provides users with interpretable information about the impact of risk factors. The results of a user study that compared the output generated by the proposed approach with the raw output of the Explainable AI model suggested higher levels of interpretability and usefulness in the context of predicting lymphedema risk.
Zeyuan Meng, Zixuan Yi, Iadh Ounis
Large Language Models (LLMs) have shown strong potential in recommender systems due to their contextual learning and generalisation capabilities. Existing LLM-based recommendation approaches typically formulate the recommendation task using specialised prompts designed to leverage their contextual abilities, and aligning their outputs closely with human preferences to yield an improved recommendation performance. However, the use of LLMs for recommendation tasks is limited by the absence of domain-specific knowledge. This lack of relevant relational knowledge about the items to be recommended in the LLM's pre-training corpus can lead to inaccuracies or hallucinations, resulting in incorrect or misleading recommendations. Moreover, directly using information from the knowledge graph introduces redundant and noisy information, which can affect the LLM's reasoning process or exceed its input context length, thereby reducing the performance of LLM-based recommendations. To address the lack of domain-specific knowledge, we propose a novel model called Knowledge-Enhanced Retrieval-Augmented Generation for Recommendation (KERAG_R). Specifically, we leverage a graph retrieval-augmented generation (GraphRAG) component to integrate additional information from a knowledge graph (KG) into instructions, enabling the LLM to collaboratively exploit recommendation signals from both text-based user interactions and the knowledge graph to better estimate the users' preferences in a recommendation context. In particular, we perform graph RAG by pre-training a graph attention network (GAT) to select the most relevant triple for the target users for the used LLM, thereby enhancing the LLM while reducing redundant and noisy information. Our extensive experiments on three public datasets show that our proposed KERAG_R model significantly outperforms ten existing state-of-the-art recommendation methods.
YiHan Jiao, ZheHao Tan, Dan Yang, DuoLin Sun, Jie Feng, Jian Wang, Peng Wei
Retrieval-augmented generation (RAG) has become a fundamental paradigm for addressing the challenges faced by large language models in handling real-time information and domain-specific problems. Traditional RAG systems primarily rely on the in-context learning (ICL) capabilities of the large language model itself. Still, in-depth research on the specific capabilities needed by the RAG generation model is lacking, leading to challenges with inconsistent document quality and retrieval system imperfections. Even the limited studies that fine-tune RAG generative models often \textit{lack a granular focus on RAG task} or \textit{a deeper utilization of chain-of-thought processes}. To address this, we propose that RAG models should possess three progressively hierarchical abilities (1) Filtering: the ability to select relevant information; (2) Combination: the ability to combine semantic information across paragraphs; and (3) RAG-specific reasoning: the ability to further process external knowledge using internal knowledge. Thus, we introduce our new RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG) incorporates a "think before answering" strategy. This method enhances the model's open-book examination capability by utilizing multi-level progressive chain-of-thought. Experiments show that the HIRAG training strategy significantly improves the model's performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.
Yiqiao Jin, Kartik Sharma, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, Srijan Kumar
Retrieval-augmented Generation (RAG) extends large language models (LLMs)
with external knowledge but faces key challenges: restricted effective context
length and redundancy in retrieved documents. Pure compression-based approaches
reduce input size but often discard fine-grained details essential for factual
accuracy. We propose SARA, a unified RAG framework that balances local
precision and global knowledge coverage under tight context budgets. SARA
combines natural-language text snippets with semantic compression vectors to
jointly enhance context efficiency and answer correctness. It represents
contexts at two complementary levels: 1) fine-grained natural-language spans
that preserve critical entities and numerical values, and 2) compact,
interpretable vectors that summarize high-level semantics. An iterative
evidence-selection module employs the compression vectors for dynamic reranking
of contexts. Across 9 datasets and 5 open-source LLMs spanning 3 model families
(Mistral, Llama, and Gemma), SARA consistently improves answer relevance
(+17.71), answer correctness (+13.72), and semantic similarity (+15.53),
demonstrating the importance of integrating textual and compressed
representations for robust, context-efficient RAG.
Authors' comments: 20 pages
Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, Liqiang Nie
Composed Image Retrieval (CIR) represents a novel retrieval paradigm that is capable of expressing users' intricate retrieval requirements flexibly. It enables the user to give a multimodal query, comprising a reference image and a modification text, and subsequently retrieve the target image. Notwithstanding the considerable advances made by prevailing methodologies, CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation, and 2) the priority of textual data in the image modification process is overlooked, which leads to a visual focus bias. To address these two limitations, this work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping. It is designed to identify significant dominant portions in images and guide the extraction of visual and textual data features, thereby reducing the impact of noise interference. Subsequently, we propose a textually guided focus revision module, which can utilize the modification requirements implied in the text to perform adaptive focus revision on the reference image, thereby enhancing the perception of the modification focus on the composed features. The aforementioned modules collectively constitute the segmentatiOn-based Focus shiFt reviSion nETwork (\mbox{OFFSET}), and comprehensive experiments on four benchmark datasets substantiate the superiority of our proposed method. The codes and data are available on https://zivchen-ty.github.io/OFFSET.github.io/
Mengyao Xu, Gabriel Moreira, Ronay Ak, Radek Osmulski, Yauhen Babakhin, Zhiding Yu, Benedikt Schifferer, Even Oldridge
Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model's retrieval capabilities.
Hengran Zhang, Keping Bi, Jiafeng Guo
While large language models (LLMs) are increasingly deployed as dense
retrievers, the impact of their domain-specific specialization on retrieval
effectiveness remains underexplored. This investigation systematically examines
how task-specific adaptations in LLMs influence their retrieval capabilities,
an essential step toward developing unified retrievers capable of handling
text, code, images, and multimodal content. We conduct extensive experiments
with eight Qwen2.5 7B LLMs, including base, instruction-tuned,
code/math-specialized, long reasoning, and vision-language models across
zero-shot retrieval settings and the supervised setting. For the zero-shot
retrieval settings, we consider text retrieval from the BEIR benchmark and code
retrieval from the CoIR benchmark. Further, to evaluate supervised performance,
all LLMs are fine-tuned on the MS MARCO dataset. We find that mathematical
specialization and the long reasoning capability cause consistent degradation
in three settings, indicating conflicts between mathematical reasoning and
semantic matching. The vision-language model and code-specialized LLMs
demonstrate superior zero-shot performance compared to other LLMs, even
surpassing BM25 on the code retrieval task, and maintain comparable performance
to base LLMs in supervised settings. These findings suggest promising
directions for the unified retrieval task leveraging cross-domain and
cross-modal fusion.
Authors' comments: Accepted by CCIR25 and published by Springer LNCS or LNAI
Ikuya Yamada, Ryokan Ri, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Dense retrievers often struggle with queries involving less-frequent entities due to their limited entity knowledge. We propose the Knowledgeable Passage Retriever (KPR), a BERT-based retriever enhanced with a context-entity attention layer and dynamically updatable entity embeddings. This design enables KPR to incorporate external entity knowledge without retraining. Experiments on three datasets show that KPR consistently improves retrieval accuracy, achieving a substantial 12.6% gain on the EntityQuestions dataset over the model without KPR extensions. When built on the off-the-shelf bge-base retriever, KPR achieves state-of-the-art performance among similarly sized models on two datasets. Code and models will be released soon.
Lea Fischbach, Akbar Karimi, Caroline Kleen, Alfred Lameli, Lucie Flek
Deep learning models for dialect identification are often limited by the
scarcity of dialectal data. To address this challenge, we propose to use
Retrieval-based Voice Conversion (RVC) as an effective data augmentation method
for a low-resource German dialect classification task. By converting audio
samples to a uniform target speaker, RVC minimizes speaker-related variability,
enabling models to focus on dialect-specific linguistic and phonetic features.
Our experiments demonstrate that RVC enhances classification performance when
utilized as a standalone augmentation method. Furthermore, combining RVC with
other augmentation methods such as frequency masking and segment removal leads
to additional performance gains, highlighting its potential for improving
dialect classification in low-resource scenarios.
Authors' comments: Accepted to Interspeech 2025
Jingyu Pan, Isaac Jacobson, Zheng Zhao, Tung-Chieh Chen, Guanglei Zhou, Chen-Chia Chang, Vineet Rashingkar, Yiran Chen
Modern very large-scale integration (VLSI) design requires the implementation
of integrated circuits using electronic design automation (EDA) tools. Due to
the complexity of EDA algorithms, the vast parameter space poses a huge
challenge to chip design optimization, as the combination of even moderate
numbers of parameters creates an enormous solution space to explore. Manual
parameter selection remains industrial practice despite being excessively
laborious and limited by expert experience. To address this issue, we present
CROP, the first large language model (LLM)-powered automatic VLSI design flow
tuning framework. Our approach includes: (1) a scalable methodology for
transforming RTL source code into dense vector representations, (2) an
embedding-based retrieval system for matching designs with semantically similar
circuits, and (3) a retrieval-augmented generation (RAG)-enhanced LLM-guided
parameter search system that constrains the search process with prior knowledge
from similar designs. Experiment results demonstrate CROP's ability to achieve
superior quality-of-results (QoR) with fewer iterations than existing
approaches on industrial designs, including a 9.9% reduction in power
consumption.
Authors' comments: Accepted by ICCAD 2025
Xinxi Lyu, Michael Duan, Rulin Shao, Pang Wei Koh, Sewon Min
Retrieval-augmented Generation (RAG) has primarily been studied in limited
settings, such as factoid question answering; more challenging,
reasoning-intensive benchmarks have seen limited success from minimal RAG. In
this work, we challenge this prevailing view on established,
reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We
identify a key missing component in prior work: a usable, web-scale datastore
aligned with the breadth of pretraining data. To this end, we introduce
CompactDS: a diverse, high-quality, web-scale datastore that achieves high
retrieval accuracy and subsecond latency on a single-node. The key insights are
(1) most web content can be filtered out without sacrificing coverage, and a
compact, high-quality subset is sufficient; and (2) combining in-memory
approximate nearest neighbor (ANN) retrieval and on-disk exact search balances
speed and recall. Using CompactDS, we show that a minimal RAG pipeline achieves
consistent accuracy improvements across all benchmarks and model sizes
(8B--70B), with relative gains of 10% on MMLU, 33% on MMLU Pro, 14% on GPQA,
and 19% on MATH. No single data source suffices alone, highlighting the
importance of diversity of sources (web crawls, curated math, academic papers,
textbooks). Finally, we show that our carefully designed in-house datastore
matches or outperforms web search engines such as Google Search, as well as
recently proposed, complex agent-based RAG systems--all while maintaining
simplicity, reproducibility, and self-containment. We release CompactDS and our
retrieval pipeline, supporting future research exploring retrieval-based AI
systems.
Authors' comments: 33 pages, 2 figures, 27 tables
Thanh-Tung Phan-Nguyen, Khoi-Nguyen Nguyen-Ngoc, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
The global fashion e-commerce industry has become integral to people's daily lives, leveraging technological advancements to offer personalized shopping experiences, primarily through recommendation systems that enhance customer engagement through personalized suggestions. To improve customers' experience in online shopping, we propose a novel comprehensive KiseKloset system for outfit retrieval, recommendation, and try-on. We explore two approaches for outfit retrieval: similar item retrieval and text feedback-guided item retrieval. Notably, we introduce a novel transformer architecture designed to recommend complementary items from diverse categories. Furthermore, we enhance the overall performance of the search pipeline by integrating approximate algorithms to optimize the search process. Additionally, addressing the crucial needs of online shoppers, we employ a lightweight yet efficient virtual try-on framework capable of real-time operation, memory efficiency, and maintaining realistic outputs compared to its predecessors. This virtual try-on module empowers users to visualize specific garments on themselves, enhancing the customers' experience and reducing costs associated with damaged items for retailers. We deployed our end-to-end system for online users to test and provide feedback, enabling us to measure their satisfaction levels. The results of our user study revealed that 84% of participants found our comprehensive system highly useful, significantly improving their online shopping experience.
Sophie Zhou, Shu Kong
Art plagiarism detection plays a crucial role in protecting artists'
copyrights and intellectual property, yet it remains a challenging problem in
forensic analysis. In this paper, we address the task of recognizing
plagiarized paintings and explaining the detected plagarisms by retrieving
visually similar authentic artworks. To support this study, we construct a
dataset by collecting painting photos and synthesizing plagiarized versions
using generative AI, tailored to specific artists' styles. We first establish a
baseline approach using off-the-shelf features from the visual foundation model
DINOv2 to retrieve the most similar images in the database and classify
plagiarism based on a similarity threshold. Surprisingly, this non-learned
method achieves a high recognition accuracy of 97.2\% but suffers from low
retrieval precision 29.0\% average precision (AP). To improve retrieval
quality, we finetune DINOv2 with a metric learning loss using positive and
negative sample pairs sampled in the database. The finetuned model greatly
improves retrieval performance by 12\% AP over the baseline, though it
unexpectedly results in a lower recognition accuracy (92.7\%). We conclude with
insightful discussions and outline directions for future research.
Authors' comments: to appear at AVSS'25