Runqi Sui
Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by integrating external knowledge bases. However, this integration introduces a new security threat: adversaries can exploit the retrieval mechanism to inject malicious content into the knowledge base, thereby influencing the generated responses. Based on this attack vector, we propose CtrlRAG, a novel attack method designed for RAG system in the black-box setting, which aligns with real-world scenarios. Unlike existing attack methods, CtrlRAG introduces a perturbation mechanism using Masked Language Model (MLM) to dynamically optimize malicious content in response to changes in the retrieved context. Experimental results demonstrate that CtrlRAG outperforms three baseline methods in both Emotional Manipulation and Hallucination Amplification objectives. Furthermore, we evaluate three existing defense mechanisms, revealing their limited effectiveness against CtrlRAG and underscoring the urgent need for more robust defenses.
Mihcael Green, Matan Levy, Issar Tzachor, Dvir Samuel, Nir Darshan, Rami Ben-Ari
We address the challenge of Small Object Image Retrieval (SoIR), where the goal is to retrieve images containing a specific small object, in a cluttered scene. The key challenge in this setting is constructing a single image descriptor, for scalable and efficient search, that effectively represents all objects in the image. In this paper, we first analyze the limitations of existing methods on this challenging task and then introduce new benchmarks to support SoIR evaluation. Next, we introduce Multi-object Attention Optimization (MaO), a novel retrieval framework which incorporates a dedicated multi-object pre-training phase. This is followed by a refinement process that leverages attention-based feature extraction with object masks, integrating them into a single unified image descriptor. Our MaO approach significantly outperforms existing retrieval methods and strong baselines, achieving notable improvements in both zero-shot and lightweight multi-object fine-tuning. We hope this work will lay the groundwork and inspire further research to enhance retrieval performance for this highly practical task.
Yifei Deng, Chenglong Li, Zhenyu Chen, Zihen Xu, Jin Tang
The performance of traditional text-image person retrieval task is easily affected by lighting variations due to imaging limitations of visible spectrum sensors. In recent years, cross-modal information fusion has emerged as an effective strategy to enhance retrieval robustness. By integrating complementary information from different spectral modalities, it becomes possible to achieve more stable person recognition and matching under complex real-world conditions. Motivated by this, we introduce a novel task: Text-RGBT Person Retrieval, which incorporates cross-spectrum information fusion by combining the complementary cues from visible and thermal modalities for robust person retrieval in challenging environments. The key challenge of Text-RGBT person retrieval lies in aligning text with multi-modal visual features. However, the inherent heterogeneity between visible and thermal modalities may interfere with the alignment between vision and language. To handle this problem, we propose a Decoupled Cross-modal Alignment network (DCAlign), which sufficiently mines the relationships between modality-specific and modality-collaborative visual with the text, for Text-RGBT person retrieval. To promote the research and development of this field, we create a high-quality Text-RGBT person retrieval dataset, RGBT-PEDES. RGBT-PEDES contains 1,822 identities from different age groups and genders with 4,723 pairs of calibrated RGB and T images, and covers high-diverse scenes from both daytime and nighttime with a various of challenges such as occlusion, weak alignment and adverse lighting conditions. Additionally, we carefully annotate 7,987 fine-grained textual descriptions for all RGBT person image pairs. Extensive experiments on RGBT-PEDES demonstrate that our method outperforms existing text-image person retrieval methods.
Muhammad Ahmed Mohsin, Ahsan Bilal, Sagnik Bhattacharya, John M. Cioffi
Future wireless networks aim to deliver high data rates and lower power
consumption while ensuring seamless connectivity, necessitating robust
optimization. Large language models (LLMs) have been deployed for generalized
optimization scenarios. To take advantage of generative AI (GAI) models, we
propose retrieval augmented generation (RAG) for multi-sensor wireless
environment perception. Utilizing domain-specific prompt engineering, we apply
RAG to efficiently harness multimodal data inputs from sensors in a wireless
environment. Key pre-processing pipelines including image-to-text conversion,
object detection, and distance calculations for multimodal RAG input from
multi-sensor data are proposed to obtain a unified vector database crucial for
optimizing LLMs in global wireless tasks. Our evaluation, conducted with
OpenAI's GPT and Google's Gemini models, demonstrates an 8%, 8%, 10%, 7%, and
12% improvement in relevancy, faithfulness, completeness, similarity, and
accuracy, respectively, compared to conventional LLM-based designs.
Furthermore, our RAG-based LLM framework with vectorized databases is
computationally efficient, providing real-time convergence under latency
constraints.
Authors' comments: Accepted @ ICC 2025
Sirinda Palahan
Overseas investment and trade can be daunting for beginners due to the vast amount of complex information. This paper presents a chatbot system that integrates natural language processing and information retrieval techniques to simplify the document retrieval process. The proposed system identifies the most relevant content, enabling users to navigate the intricate landscape of foreign trade and investment more efficiently. Our methodology combines the BM25 model and a deep learning model to rank and retrieve documents, aiming to reduce noise in the document content and enhance the accuracy of the results. Experiments with Thai natural language queries have demonstrated the effectiveness of our system in retrieving pertinent documents. A user satisfaction survey further validated the system's effectiveness. Most respondents found the system helpful and agreed with the suggested documents, indicating its potential as a valuable tool for Thai entrepreneurs navigating foreign trade and investment.
Xukun Zhou, Fengxin Li, Ming Chen, Yan Zhou, Pengfei Wan, Di Zhang, Hongyan Liu, Jun He et al.
Audio-driven human gesture synthesis is a crucial task with broad applications in virtual avatars, human-computer interaction, and creative content generation. Despite notable progress, existing methods often produce gestures that are coarse, lack expressiveness, and fail to fully align with audio semantics. To address these challenges, we propose ExGes, a novel retrieval-enhanced diffusion framework with three key designs: (1) a Motion Base Construction, which builds a gesture library using training dataset; (2) a Motion Retrieval Module, employing constrative learning and momentum distillation for fine-grained reference poses retreiving; and (3) a Precision Control Module, integrating partial masking and stochastic masking to enable flexible and fine-grained control. Experimental evaluations on BEAT2 demonstrate that ExGes reduces Fr\'echet Gesture Distance by 6.2\% and improves motion diversity by 5.3\% over EMAGE, with user studies revealing a 71.3\% preference for its naturalness and semantic relevance. Code will be released upon acceptance.
Hairu Wang, Yuan Feng, Xike Xie, S Kevin Zhou
Although Large Language Models achieve strong success in many tasks, they still suffer from hallucinations and knowledge deficiencies in real-world applications. Many knowledge graph-based retrieval-augmented generation (KG-RAG) methods enhance the quality and credibility of LLMs by leveraging structure and semantic information in KGs as external knowledge bases. However, these methods struggle to effectively incorporate structure information, either incurring high computational costs or underutilizing available knowledge. Inspired by smoothing operations in graph representation learning, we propose path pooling, a simple, train-free strategy that introduces structure information through a novel path-centric pooling operation. It seamlessly integrates into existing KG-RAG methods in a plug-and-play manner, enabling richer structure information utilization. Extensive experiments demonstrate that incorporating the path pooling into the state-of-the-art KG-RAG method consistently improves performance across various settings while introducing negligible additional cost. Code is coming soon at https://github.com/hrwang00/path-pooling.
Shiping Yang, Jie Wu, Wenbiao Ding, Ning Wu, Shining Liang, Ming Gong, Hengyuan Zhang, Dongmei Zhang
Robustness has become a critical attribute for the deployment of RAG systems in real-world applications. Existing research focuses on robustness to explicit noise (e.g., document semantics) but overlooks spurious features (a.k.a. implicit noise). While previous works have explored spurious features in LLMs, they are limited to specific features (e.g., formats) and narrow scenarios (e.g., ICL). In this work, we statistically confirm the presence of spurious features in the RAG paradigm, a robustness problem caused by the sensitivity of LLMs to semantic-agnostic features. Moreover, we provide a comprehensive taxonomy of spurious features and empirically quantify their impact through controlled experiments. Further analysis reveals that not all spurious features are harmful and they can even be beneficial sometimes. Extensive evaluation results across multiple LLMs suggest that spurious features are a widespread and challenging problem in the field of RAG. The code and dataset will be released to facilitate future research. We release all codes and data at: $\\\href{https://github.com/maybenotime/RAG-SpuriousFeatures}{https://github.com/maybenotime/RAG-SpuriousFeatures}$.
Chan hur, Jeong-hun Hong, Dong-hun Lee, Dabin Kang, Semin Myeong, Sang-hyo Park, Hyeyoung Park
In recent text-video retrieval, the use of additional captions from
vision-language models has shown promising effects on the performance. However,
existing models using additional captions often have struggled to capture the
rich semantics, including temporal changes, inherent in the video. In addition,
incorrect information caused by generative models can lead to inaccurate
retrieval. To address these issues, we propose a new framework, Narrating the
Video (NarVid), which strategically leverages the comprehensive information
available from frame-level captions, the narration. The proposed NarVid
exploits narration in multiple ways: 1) feature enhancement through cross-modal
interactions between narration and video, 2) query-aware adaptive filtering to
suppress irrelevant or incorrect information, 3) dual-modal matching score by
adding query-video similarity and query-narration similarity, and 4)
hard-negative loss to learn discriminative features from multiple perspectives
using the two similarities from different views. Experimental results
demonstrate that NarVid achieves state-of-the-art performance on various
benchmark datasets.
Authors' comments: Accepted at CVPR 2025
João Alberto de Oliveira Lima
When users formulate queries, they often include not only the information
they seek, but also pragmatic markers such as interrogative phrasing or polite
requests. Although these speech act indicators communicate the
user\textquotesingle s intent -- whether it is asking a question, making a
request, or stating a fact -- they do not necessarily add to the core
informational content of the query itself. This paper investigates whether
extracting the underlying propositional content from user utterances --
essentially stripping away the linguistic markers of intent -- can improve
retrieval quality in Retrieval-Augmented Generation (RAG) systems. Drawing upon
foundational insights from speech act theory, we propose a practical method for
automatically transforming queries into their propositional equivalents before
embedding. To assess the efficacy of this approach, we conducted an
experimental study involving 63 user queries related to a Brazilian
telecommunications news corpus with precomputed semantic embeddings. Results
demonstrate clear improvements in semantic similarity between query embeddings
and document embeddings at top ranks, confirming that queries stripped of
speech act indicators more effectively retrieve relevant content.
Authors' comments: 19 pages, 4 figures
Jungbae Park, Heonseok Jang
E-commerce search optimization has evolved to include a wider range of metrics that reflect user engagement and business objectives. Modern search frameworks now incorporate advanced quality features, such as sales counts and document-query relevance, to better align search results with these goals. Traditional methods typically focus on click-through rate (CTR) as a measure of engagement or relevance, but this can miss true purchase intent, creating a gap between user interest and actual conversions. Joint training with the click-through conversion rate (CTCVR) has become essential for understanding buying behavior, although its sparsity poses challenges for reliable optimization. This study presents MOHPER, a Multi-Objective Hyperparameter Optimization framework for E-commerce Retrieval systems. Utilizing Bayesian optimization and sampling, it jointly optimizes both CTR, CTCVR, and relevant objectives, focusing on engagement and conversion of the users. In addition, to improve the selection of the best configuration from multi-objective optimization, we suggest advanced methods for hyperparameter selection, including a meta-configuration voting strategy and a cumulative training approach that leverages prior optimal configurations, to improve speeds of training and efficiency. Currently deployed in a live setting, our proposed framework substantiates its practical efficacy in achieving a balanced optimization that aligns with both user satisfaction and revenue goals.
Jasper Kyle Catapang
Yes, repurposing multiple-choice question-answering (MCQA) models for
document reranking is both feasible and valuable. This preliminary work is
founded on mathematical parallels between MCQA decision-making and
cross-encoder semantic relevance assessments, leading to the development of R*,
a proof-of-concept model that harmonizes these approaches. Designed to assess
document relevance with depth and precision, R* showcases how MCQA's principles
can improve reranking in information retrieval (IR) and retrieval-augmented
generation (RAG) systems -- ultimately enhancing search and dialogue in
AI-powered systems. Through experimental validation, R* proves to improve
retrieval accuracy and contribute to the field's advancement by demonstrating a
practical prototype of MCQA for reranking by keeping it lightweight.
Authors' comments: Accepted to The 38th Pacific Asia Conference on Language, Information
and Computation; PACLIC 38 (2024)
Tingyu Song, Guo Gan, Mingsheng Shang, Yilun Zhao
We introduce IFIR, the first comprehensive benchmark designed to evaluate
instruction-following information retrieval (IR) in expert domains. IFIR
includes 2,426 high-quality examples and covers eight subsets across four
specialized domains: finance, law, healthcare, and science literature. Each
subset addresses one or more domain-specific retrieval tasks, replicating
real-world scenarios where customized instructions are critical. IFIR enables a
detailed analysis of instruction-following retrieval capabilities by
incorporating instructions at different levels of complexity. We also propose a
novel LLM-based evaluation method to provide a more precise and reliable
assessment of model performance in following instructions. Through extensive
experiments on 15 frontier retrieval models, including those based on LLMs, our
results reveal that current models face significant challenges in effectively
following complex, domain-specific instructions. We further provide in-depth
analyses to highlight these limitations, offering valuable insights to guide
future advancements in retriever development.
Authors' comments: NAACL 2025 Main
Tengfei Zhang, Ziheng Zhao, Chaoyi Wu, Xiao Zhou, Ya Zhang, Yangfeng Wang, Weidi Xie
Developing advanced medical imaging retrieval systems is challenging due to the varying definitions of `similar images' across different medical contexts. This challenge is compounded by the lack of large-scale, high-quality medical imaging retrieval datasets and benchmarks. In this paper, we propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities in a scalable and fully automatic manner. Using this approach, we construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans, providing detailed image-image ranking annotations conditioned on diverse anatomical structures. Furthermore, we develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks. These systems also enable flexible, effective image retrieval conditioned on specific anatomical structures described in text, achieving state-of-the-art results on 77 out of 78 metrics.
Sangyeop Kim, Sohhyung Park, Jaewon Jung, Jinseok Kim, Sungzoon Cho
Understanding user satisfaction with conversational systems, known as User
Satisfaction Estimation (USE), is essential for assessing dialogue quality and
enhancing user experiences. However, existing methods for USE face challenges
due to limited understanding of underlying reasons for user dissatisfaction and
the high costs of annotating user intentions. To address these challenges, we
propose PRAISE (Plan and Retrieval Alignment for Interpretable Satisfaction
Estimation), an interpretable framework for effective user satisfaction
prediction. PRAISE operates through three key modules. The Strategy Planner
develops strategies, which are natural language criteria for classifying user
satisfaction. The Feature Retriever then incorporates knowledge on user
satisfaction from Large Language Models (LLMs) and retrieves relevance features
from utterances. Finally, the Score Analyzer evaluates strategy predictions and
classifies user satisfaction. Experimental results demonstrate that PRAISE
achieves state-of-the-art performance on three benchmarks for the USE task.
Beyond its superior performance, PRAISE offers additional benefits. It enhances
interpretability by providing instance-level explanations through effective
alignment of utterances with strategies. Moreover, PRAISE operates more
efficiently than existing approaches by eliminating the need for LLMs during
the inference phase.
Authors' comments: Accepted by NAACL 2025
Sangyeop Kim, Hangyeul Lee, Yohan Lee
The growth of conversational AI services has increased demand for effective
information retrieval from dialogue data. However, existing methods often face
challenges in capturing semantic intent or require extensive labeling and
fine-tuning. This paper introduces HEISIR (Hierarchical Expansion of Inverted
Semantic Indexing for Retrieval), a novel framework that enhances semantic
understanding in conversational data retrieval through optimized data
ingestion, eliminating the need for resource-intensive labeling or model
adaptation. HEISIR implements a two-step process: (1) Hierarchical Triplets
Formulation and (2) Adjunct Augmentation, creating semantic indices consisting
of Subject-Verb-Object-Adjunct (SVOA) quadruplets. This structured
representation effectively captures the underlying semantic information from
dialogue content. HEISIR achieves high retrieval performance while maintaining
low latency during the actual retrieval process. Our experimental results
demonstrate that HEISIR outperforms fine-tuned models across various embedding
types and language models. Beyond improving retrieval capabilities, HEISIR also
offers opportunities for intent and topic analysis in conversational data,
providing a versatile solution for dialogue systems.
Authors' comments: Accepted by NAACL 2025 (Findings)
Yating Liu, Zimo Liu, Xiangyuan Lan, Wenming Yang, Yaowei Li, Qingmin Liao
Text-based person retrieval (TPR) has gained significant attention as a
fine-grained and challenging task that closely aligns with practical
applications. Tailoring CLIP to person domain is now a emerging research topic
due to the abundant knowledge of vision-language pretraining, but challenges
still remain during fine-tuning: (i) Previous full-model fine-tuning in TPR is
computationally expensive and prone to overfitting.(ii) Existing
parameter-efficient transfer learning (PETL) for TPR lacks of fine-grained
feature extraction. To address these issues, we propose Domain-Aware
Mixture-of-Adapters (DM-Adapter), which unifies Mixture-of-Experts (MOE) and
PETL to enhance fine-grained feature representations while maintaining
efficiency. Specifically, Sparse Mixture-of-Adapters is designed in parallel to
MLP layers in both vision and language branches, where different experts
specialize in distinct aspects of person knowledge to handle features more
finely. To promote the router to exploit domain information effectively and
alleviate the routing imbalance, Domain-Aware Router is then developed by
building a novel gating function and injecting learnable domain-aware prompts.
Extensive experiments show that our DM-Adapter achieves state-of-the-art
performance, outperforming previous methods by a significant margin.
Authors' comments: 9 pages, 5 figures, accepted by AAAI 2025
Bryan Li, Jiaming Luo, Eleftheria Briakou, Colin Cherry
While large language models (LLMs) have been increasingly adopted for machine translation (MT), their performance for specialist domains such as medicine and law remains an open challenge. Prior work has shown that LLMs can be domain-adapted at test-time by retrieving targeted few-shot demonstrations or terminologies for inclusion in the prompt. Meanwhile, for general-purpose LLM MT, recent studies have found some success in generating similarly useful domain knowledge from an LLM itself, prior to translation. Our work studies domain-adapted MT with LLMs through a careful prompting setup, finding that demonstrations consistently outperform terminology, and retrieval consistently outperforms generation. We find that generating demonstrations with weaker models can close the gap with larger model's zero-shot performance. Given the effectiveness of demonstrations, we perform detailed analyses to understand their value. We find that domain-specificity is particularly important, and that the popular multi-domain benchmark is testing adaptation to a particular writing style more so than to a specific domain.
Mohsen Fayyaz, Ali Modarressi, Hinrich Schuetze, Nanyun Peng
Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid failures. In this work, by repurposing a relation extraction dataset (e.g. Re-DocRED), we design controlled experiments to quantify the impact of heuristic biases, such as favoring shorter documents, in retrievers like Dragon+ and Contriever. Our findings reveal significant vulnerabilities: retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. Additionally, they tend to overlook whether the document contains the query's answer, lacking deep semantic understanding. Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 3% of cases over a biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than not providing any documents at all.
Kanghui Ning, Zijie Pan, Yu Liu, Yushan Jiang, James Y. Zhang, Kashif Rasul, Anderson Schneider, Lintao Ma et al.
Large Language Models (LLMs) and Foundation Models (FMs) have recently become prevalent for time series forecasting tasks. While fine-tuning LLMs enables domain adaptation, they often struggle to generalize across diverse and unseen datasets. Moreover, existing Time Series Foundation Models (TSFMs) still face challenges in handling non-stationary dynamics and distribution shifts, largely due to the lack of effective mechanisms for adaptation. To this end, we present TS-RAG, a retrieval-augmented generation framework for time series forecasting that enhances the generalization and interpretability of TSFMs. Specifically, TS-RAG leverages pre-trained time series encoders to retrieve semantically relevant segments from a dedicated knowledge base, enriching the contextual representation of the input query. Furthermore, we propose an Adaptive Retrieval Mixer (ARM) module that dynamically fuses the retrieved patterns with the TSFM's internal representation, improving forecasting accuracy without requiring task-specific fine-tuning. Thorough empirical studies on seven public benchmark datasets demonstrate that TS-RAG achieves state-of-the-art zero-shot forecasting performance, outperforming the existing TSFMs by up to 6.84% across diverse domains while also providing desirable interpretability.