Runtao Ren, Jian Ma, Jianxi Luo
Retrieval-Augmented Generation (RAG) systems in the Intellectual Property (IP) field often struggle with diverse user queries, including colloquial expressions, spelling errors, and ambiguous terminology, leading to inaccurate retrieval and suboptimal responses. To address this challenge, we propose Multi-Angle Question Generation and Retrieval Fine-Tuning Method (MQG-RFM), a novel framework that leverages large language models (LLMs) to simulate varied user inquiries and fine-tunes retrieval models to align semantically equivalent but linguistically diverse questions. Unlike complex architectural modifications, MQG-RFM adopts a lightweight Data-to-Tune paradigm, combining prompt-engineered query generation with hard negative mining to enhance retrieval robustness without costly infrastructure changes. Experimental results on a Taiwan patent Q&A dataset show 185.62% improvement in retrieval accuracy on the Patent Consultation dataset and 262.26% improvement on the Novel Patent Technology Report dataset, with 14.22% and 53.58% improvements in generation quality over the baselines, respectively. By bridging the gap between user intent and system comprehension through semantic-aware retrieval optimization, MQG-RFM offers a practical, scalable approach for rapid, cost-effective deployment among small and medium-sized agencies seeking reliable patent intelligence solutions. Additionally, our proposed method has already been adopted by ScholarMate, the largest professional research social networking platform in China, to support real-world development and deployment. A demo version of the instantiated is available at https://github.com/renruntao/patent_rag.
Linhao Ye, Lang Yu, Zhikai Lei, Qin Chen, Jie Zhou, Liang He
Retrieval-augmented generation (RAG) is usually integrated into large language models (LLMs) to mitigate hallucinations and knowledge obsolescence. Whereas,conventional one-step retrieve-and-read methods are insufficient for multi-hop question answering, facing challenges of retrieval semantic mismatching and the high cost in handling interdependent subquestions. In this paper, we propose Optimizing Question Semantic Space for Dynamic Retrieval-Augmented Multi-hop Question Answering (Q-DREAM). Q-DREAM consists of three key modules: (1) the Question Decomposition Module (QDM), which decomposes multi-hop questions into fine-grained subquestions; (2) the Subquestion Dependency Optimizer Module (SDOM), which models the interdependent relations of subquestions for better understanding; and (3) the Dynamic Passage Retrieval Module (DPRM), which aligns subquestions with relevant passages by optimizing the semantic embeddings. Experimental results across various benchmarks demonstrate that Q-DREAM significantly outperforms existing RAG methods, achieving state-of-the-art performance in both in-domain and out-of-domain settings. Notably, Q-DREAM also improves retrieval efficiency while maintaining high accuracy compared with recent baselines.
Yubai Wei, Jiale Han, Yi Yang
Text embedding models play a cornerstone role in AI applications, such as
retrieval-augmented generation (RAG). While general-purpose text embedding
models demonstrate strong performance on generic retrieval benchmarks, their
effectiveness diminishes when applied to private datasets (e.g.,
company-specific proprietary data), which often contain specialized terminology
and lingo. In this work, we introduce BMEmbed, a novel method for adapting
general-purpose text embedding models to private datasets. By leveraging the
well-established keyword-based retrieval technique (BM25), we construct
supervisory signals from the ranking of keyword-based retrieval results to
facilitate model adaptation. We evaluate BMEmbed across a range of domains,
datasets, and models, showing consistent improvements in retrieval performance.
Moreover, we provide empirical insights into how BM25-based signals contribute
to improving embeddings by fostering alignment and uniformity, highlighting the
value of this approach in adapting models to domain-specific data. We release
the source code available at https://github.com/BaileyWei/BMEmbed for the
research community.
Authors' comments: Link: https://github.com/BaileyWei/BMEmbed
Boyi Zhang, Zhuo Liu, Hangfeng He
In real practice, questions are typically complex and knowledge-intensive, requiring Large Language Models (LLMs) to recognize the multifaceted nature of the question and reason across multiple information sources. Iterative and adaptive retrieval, where LLMs decide when and what to retrieve based on their reasoning, has been shown to be a promising approach to resolve complex, knowledge-intensive questions. However, the performance of such retrieval frameworks is limited by the accumulation of reasoning errors and misaligned retrieval results. To overcome these limitations, we propose TreeRare (Syntax Tree-Guided Retrieval and Reasoning), a framework that utilizes syntax trees to guide information retrieval and reasoning for question answering. Following the principle of compositionality, TreeRare traverses the syntax tree in a bottom-up fashion, and in each node, it generates subcomponent-based queries and retrieves relevant passages to resolve localized uncertainty. A subcomponent question answering module then synthesizes these passages into concise, context-aware evidence. Finally, TreeRare aggregates the evidence across the tree to form a final answer. Experiments across five question answering datasets involving ambiguous or multi-hop reasoning demonstrate that TreeRare achieves substantial improvements over existing state-of-the-art methods.
Chris M. Ward, Josh Harguess
Retrieval-Augmented Generation (RAG) systems, which integrate Large Language
Models (LLMs) with external knowledge sources, are vulnerable to a range of
adversarial attack vectors. This paper examines the importance of RAG systems
through recent industry adoption trends and identifies the prominent attack
vectors for RAG: prompt injection, data poisoning, and adversarial query
manipulation. We analyze these threats under risk management lens, and propose
robust prioritized control list that includes risk-mitigating actions like
input validation, adversarial training, and real-time monitoring.
Authors' comments: SPIE DCS: Proceedings Volume Assurance and Security for AI-enabled
Systems 2025
Jiayu Yao, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Yuyao Ge, Zhecheng Li, Xueqi Cheng
Multimodal Retrieval-Augmented Generation (RAG) systems have become essential in knowledge-intensive and open-domain tasks. As retrieval complexity increases, ensuring the robustness of these systems is critical. However, current RAG models are highly sensitive to the order in which evidence is presented, often resulting in unstable performance and biased reasoning, particularly as the number of retrieved items or modality diversity grows. This raises a central question: How does the position of retrieved evidence affect multimodal RAG performance? To answer this, we present the first comprehensive study of position bias in multimodal RAG systems. Through controlled experiments across text-only, image-only, and mixed-modality tasks, we observe a consistent U-shaped accuracy curve with respect to evidence position. To quantify this bias, we introduce the Position Sensitivity Index ($PSI_p$) and develop a visualization framework to trace attention allocation patterns across decoder layers. Our results reveal that multimodal interactions intensify position bias compared to unimodal settings, and that this bias increases logarithmically with retrieval range. These findings offer both theoretical and empirical foundations for position-aware analysis in RAG, highlighting the need for evidence reordering or debiasing strategies to build more reliable and equitable generation systems.
Juraj Vladika, Annika Domres, Mai Nguyen, Rebecca Moser, Jana Nano, Felix Busch, Lisa C. Adams, Keno K. Bressem et al.
Large language models (LLMs) exhibit extensive medical knowledge but are
prone to hallucinations and inaccurate citations, which pose a challenge to
their clinical adoption and regulatory compliance. Current methods, such as
Retrieval Augmented Generation, partially address these issues by grounding
answers in source documents, but hallucinations and low fact-level
explainability persist. In this work, we introduce a novel atomic fact-checking
framework designed to enhance the reliability and explainability of LLMs used
in medical long-form question answering. This method decomposes LLM-generated
responses into discrete, verifiable units called atomic facts, each of which is
independently verified against an authoritative knowledge base of medical
guidelines. This approach enables targeted correction of errors and direct
tracing to source literature, thereby improving the factual accuracy and
explainability of medical Q&A. Extensive evaluation using multi-reader
assessments by medical experts and an automated open Q&A benchmark demonstrated
significant improvements in factual accuracy and explainability. Our framework
achieved up to a 40% overall answer improvement and a 50% hallucination
detection rate. The ability to trace each atomic fact back to the most relevant
chunks from the database provides a granular, transparent explanation of the
generated responses, addressing a major gap in current medical AI applications.
This work represents a crucial step towards more trustworthy and reliable
clinical applications of LLMs, addressing key prerequisites for clinical
application and fostering greater confidence in AI-assisted healthcare.
Authors' comments: 11 pages, 4 figures
Zahid Hassan Tushar, Adeleke Ademakinwa, Jianwu Wang, Zhibo Zhang, Sanjay Purushotham
Cloud Optical Thickness (COT) is a critical cloud property influencing
Earth's climate, weather, and radiation budget. Satellite radiance measurements
enable global COT retrieval, but challenges like 3D cloud effects, viewing
angles, and atmospheric interference must be addressed to ensure accurate
estimation. Traditionally, the Independent Pixel Approximation (IPA) method,
which treats individual pixels independently, has been used for COT estimation.
However, IPA introduces significant bias due to its simplified assumptions.
Recently, deep learning-based models have shown improved performance over IPA
but lack robustness, as they are sensitive to variations in radiance intensity,
distortions, and cloud shadows. These models also introduce substantial errors
in COT estimation under different solar and viewing zenith angles. To address
these challenges, we propose a novel angle-invariant, attention-based deep
model called Cloud-Attention-Net with Angle Coding (CAAC). Our model leverages
attention mechanisms and angle embeddings to account for satellite viewing
geometry and 3D radiative transfer effects, enabling more accurate retrieval of
COT. Additionally, our multi-angle training strategy ensures angle invariance.
Through comprehensive experiments, we demonstrate that CAAC significantly
outperforms existing state-of-the-art deep learning models, reducing cloud
property retrieval errors by at least a factor of nine.
Authors' comments: 6 pages, 7 figures, to be published in 2025 IEEE International
Conference on Image Processing (ICIP)
Ming-Hsun Yang
This work examines the multi-view compressive phase retrieval problem in a distributed sensor network, where each sensor device, limited by storage and sensing capabilities, can access only intensity measurements from an unknown part of the global sparse vector. The goal is to enable each sensor to recover its observable sparse signal when measurements are corrupted by outliers. To achieve reliable local signal recovery with limited data access, we propose a distributed reconstruction algorithm that enables collaboration among sensor devices without the need to share individual raw data. The proposed scheme employs a two-stage approach that first recovers the amplitude of the global signal (at a central server) and subsequently estimates the observable nonzero signal entries (at each local device). Our analytic results show that perfect global signal amplitude recovery can be achieved under mild conditions on the support size of sparse outliers and the view blockage level. In addition, the exact reconstruction of locally observed signal components is shown to be attainable in the noise-free case by solving a binary optimization problem, subject to a mild requirement on the structure of the sensing matrix. Computer simulations are provided to illustrate the effectiveness of the proposed scheme.
Yingjia Xu, Jinlin Wu, Zhen Chen, Daming Gao, Yang Yang, Zhen Lei, Min Cao
Text-based person retrieval aims to identify a target individual from a
gallery of images based on a natural language description. It presents a
significant challenge due to the complexity of real-world scenes and the
ambiguity of appearance-related descriptions. Existing methods primarily
emphasize appearance-based cross-modal retrieval, often neglecting the
contextual information embedded within the scene, which can offer valuable
complementary insights for retrieval. To address this, we introduce
SCENEPERSON-13W, a large-scale dataset featuring over 100,000 scenes with rich
annotations covering both pedestrian appearance and environmental cues. Based
on this, we propose SA-Person, a two-stage retrieval framework. In the first
stage, it performs discriminative appearance grounding by aligning textual cues
with pedestrian-specific regions. In the second stage, it introduces
SceneRanker, a training-free, scene-aware re-ranking method leveraging
multimodal large language models to jointly reason over pedestrian appearance
and the global scene context. Experiments on SCENEPERSON-13W validate the
effectiveness of our framework in challenging scene-level retrieval scenarios.
The code and dataset will be made publicly available.
Authors' comments: 22 pages, 7 figures. Under review
Hao Chen, Yukun Yan, Sen Mei, Wanxiang Che, Zhenghao Liu, Qi Shi, Xinze Li, Yuchun Fan et al.
Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge to improve factuality. However, existing RAG systems frequently underutilize the retrieved documents, failing to extract and integrate the key clues needed to support faithful and interpretable reasoning, especially in cases where relevant evidence is implicit, scattered, or obscured by noise. To address this issue, we propose ClueAnchor, a novel framework for enhancing RAG via clue-anchored reasoning exploration and optimization. ClueAnchor extracts key clues from retrieved content and generates multiple reasoning paths based on different knowledge configurations, optimizing the model by selecting the most effective one through reward-based preference optimization. Experiments show that ClueAnchor significantly outperforms prior RAG baselines in reasoning completeness and robustness. Further analysis confirms its strong resilience to noisy or partially relevant retrieved content, as well as its capability to identify supporting evidence even in the absence of explicit clue supervision during inference.
Jun Fan, Ailing Yan, Xianchao Xiu, Wanquan Liu
Phase retrieval (PR) is a popular research topic in signal processing and machine learning. However, its performance degrades significantly when the measurements are corrupted by noise or outliers. To address this limitation, we propose a novel robust sparse PR method that covers both real- and complex-valued cases. The core is to leverage the Huber function to measure the loss and adopt the $\ell_{1/2}$-norm regularization to realize feature selection, thereby improving the robustness of PR. In theory, we establish statistical guarantees for such robustness and derive necessary optimality conditions for global minimizers. Particularly, for the complex-valued case, we provide a fixed point inclusion property inspired by Wirtinger derivatives. Furthermore, we develop an efficient optimization algorithm by integrating the gradient descent method into a majorization-minimization (MM) framework. It is rigorously proved that the whole generated sequence is convergent and also has a linear convergence rate under mild conditions, which has not been investigated before. Numerical examples under different types of noise validate the robustness and effectiveness of our proposed method.
Dohyeon Lee, Yeonseok Jeong, Seung-won Hwang
Chain-of-Thought (CoT) prompting enables complex reasoning in large language models (LLMs), including applications in information retrieval (IR). However, it often leads to overthinking, where models produce excessively long and semantically redundant traces with little or no benefit. We identify two key challenges in IR: redundant trajectories that revisit similar states and misguided reasoning that diverges from user intent. To address these, we propose State Machine Reasoning (SMR), a transition-based reasoning framework composed of discrete actions (Refine, Rerank, Stop) that support early stopping and fine-grained control. Experiments on the BEIR and BRIGHT benchmarks show that SMR improves retrieval performance (nDCG@10) by 3.4% while reducing token usage by 74.4%. It generalizes across LLMs and retrievers without requiring task-specific tuning, offering a practical alternative to conventional CoT reasoning. The code and details are available at https://github.com/ldilab/SMR.
Koki Matsuishi, Kosuke Ukita, Tsuyoshi Okita
In recent years, the widespread adoption of wearable devices has highlighted
the growing importance of behavior analysis using IMU. While applications span
diverse fields such as healthcare and robotics, recent studies have
increasingly focused on multimodal analysis, in addition to unimodal analysis.
Several studies have proposed multimodal foundation models that incorporate
first-person video and text data; however, these models still fall short in
providing a detailed analysis of full-body human activity. To address this
limitation, we propose Activity Understanding and Representations Alignment -
Multimodal Foundation Model (AURA-MFM), a foundational model integrating four
modalities: third-person video, motion capture, IMU, and text. By incorporating
third-person video and motion capture data, the model enables a detailed and
multidimensional understanding of human activity, which first-person
perspectives alone fail to capture. Additionally, a Transformer-based IMU
encoder is employed to enhance the model's overall performance. Experimental
evaluations on retrieval and activity recognition tasks demonstrate that our
model surpasses existing methods. Notably, in the zero-shot classification for
action recognition, our method achieved significantly higher performance, with
an F1-score of 0.6226 and an accuracy of 0.7320, whereas the existing method
recorded an F1-score of 0.0747 and an accuracy of 0.1961.
Authors' comments: 25 pages, 8 figures
Chan-Wei Hu, Yueqi Wang, Shuo Xing, Chia-Ju Chen, Zhengzhong Tu
Large Vision-Language Models (LVLMs) have made remarkable strides in
multimodal tasks such as visual question answering, visual grounding, and
complex reasoning. However, they remain limited by static training data,
susceptibility to hallucinations, and inability to verify claims against
up-to-date, external evidence, compromising their performance in dynamic
real-world applications. Retrieval-Augmented Generation (RAG) offers a
practical solution to mitigate these challenges by allowing the LVLMs to access
large-scale knowledge databases via retrieval mechanisms, thereby grounding
model outputs in factual, contextually relevant information. Here in this
paper, we conduct the first systematic dissection of the multimodal RAG
pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the
modality configurations and retrieval strategies, (2) the re-ranking stage: on
strategies to mitigate positional biases and improve the relevance of retrieved
evidence, and (3) the generation phase: we further investigate how to best
integrate retrieved candidates into the final generation process. Finally, we
extend to explore a unified agentic framework that integrates re-ranking and
generation through self-reflection, enabling LVLMs to select relevant evidence
and suppress irrelevant context dynamically. Our full-stack exploration of RAG
for LVLMs yields substantial insights, resulting in an average performance
boost of 5% without any fine-tuning.
Authors' comments: 16 pages, 11 figures
Mingyang Mao, Mariela M. Perez-Cabarcas, Utteja Kallakuri, Nicholas R. Waytowich, Xiaomin Lin, Tinoosh Mohsenin
To effectively engage in human society, the ability to adapt, filter information, and make informed decisions in ever-changing situations is critical. As robots and intelligent agents become more integrated into human life, there is a growing opportunity-and need-to offload the cognitive burden on humans to these systems, particularly in dynamic, information-rich scenarios. To fill this critical need, we present Multi-RAG, a multimodal retrieval-augmented generation system designed to provide adaptive assistance to humans in information-intensive circumstances. Our system aims to improve situational understanding and reduce cognitive load by integrating and reasoning over multi-source information streams, including video, audio, and text. As an enabling step toward long-term human-robot partnerships, Multi-RAG explores how multimodal information understanding can serve as a foundation for adaptive robotic assistance in dynamic, human-centered situations. To evaluate its capability in a realistic human-assistance proxy task, we benchmarked Multi-RAG on the MMBench-Video dataset, a challenging multimodal video understanding benchmark. Our system achieves superior performance compared to existing open-source video large language models (Video-LLMs) and large vision-language models (LVLMs), while utilizing fewer resources and less input data. The results demonstrate Multi- RAG's potential as a practical and efficient foundation for future human-robot adaptive assistance systems in dynamic, real-world contexts.
Chaitanya Sharma
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm to enhance large language models (LLMs) by conditioning generation on external evidence retrieved at inference time. While RAG addresses critical limitations of parametric knowledge storage-such as factual inconsistency and domain inflexibility-it introduces new challenges in retrieval quality, grounding fidelity, pipeline efficiency, and robustness against noisy or adversarial inputs. This survey provides a comprehensive synthesis of recent advances in RAG systems, offering a taxonomy that categorizes architectures into retriever-centric, generator-centric, hybrid, and robustness-oriented designs. We systematically analyze enhancements across retrieval optimization, context filtering, decoding control, and efficiency improvements, supported by comparative performance analyses on short-form and multi-hop question answering tasks. Furthermore, we review state-of-the-art evaluation frameworks and benchmarks, highlighting trends in retrieval-aware evaluation, robustness testing, and federated retrieval settings. Our analysis reveals recurring trade-offs between retrieval precision and generation flexibility, efficiency and faithfulness, and modularity and coordination. We conclude by identifying open challenges and future research directions, including adaptive retrieval architectures, real-time retrieval integration, structured reasoning over multi-hop evidence, and privacy-preserving retrieval mechanisms. This survey aims to consolidate current knowledge in RAG research and serve as a foundation for the next generation of retrieval-augmented language modeling systems.
Nikita Khramov, Andrei Kozyrev, Gleb Solovev, Anton Podkopaev
Interactive Theorem Proving was repeatedly shown to be fruitful combined with Generative Artificial Intelligence. This paper assesses multiple approaches to Rocq generation and illuminates potential avenues for improvement. We highlight the importance of thorough premise selection for generating Rocq proofs and propose a novel approach, leveraging retrieval via a self-attentive embedder model. The evaluation of the designed approach shows up to 28% relative increase of the generator's performance. We tackle the problem of writing Rocq proofs using a multi-stage agentic system, tailored for formal verification, and demonstrate its high effectiveness. We conduct an ablation study and show the use of multi-agent debate on the planning stage of proof synthesis.
Arjun Rao, Hanieh Alipour, Nick Pendar
This paper presents a comparison of embedding models in tri-modal hybrid retrieval for Retrieval-Augmented Generation (RAG) systems. We investigate the fusion of dense semantic, sparse lexical, and graph-based embeddings, focusing on the performance of the MiniLM-v6 and BGE-Large architectures. Contrary to conventional assumptions, our results show that the compact MiniLM-v6 outperforms the larger BGE-Large when integrated with LLM-based re-ranking within our tri-modal hybrid framework. Experiments conducted on the SciFact, FIQA, and NFCorpus datasets demonstrate significant improvements in retrieval quality with the MiniLM-v6 configuration. The performance difference is particularly pronounced in agentic re-ranking scenarios, indicating better alignment between MiniLM-v6's embedding space and LLM reasoning. Our findings suggest that embedding model selection for RAG systems should prioritize compatibility with multi-signal fusion and LLM alignment, rather than relying solely on larger models. This approach may reduce computational requirements while improving retrieval accuracy and efficiency.
Tzvika Geft, Kostas Bekris, Jingjin Yu
Grid-based storage systems with uniformly shaped loads (e.g., containers, pallets, totes) are commonplace in logistics, industrial, and transportation domains. A key performance metric for such systems is the maximization of space utilization, which requires some loads to be placed behind or below others, preventing direct access to them. Consequently, dense storage settings bring up the challenge of determining how to place loads while minimizing costly rearrangement efforts necessary during retrieval. This paper considers the setting involving an inbound phase, during which loads arrive, followed by an outbound phase, during which loads depart. The setting is prevalent in distribution centers, automated parking garages, and container ports. In both phases, minimizing the number of rearrangement actions results in more optimal (e.g., fast, energy-efficient, etc.) operations. In contrast to previous work focusing on stack-based systems, this effort examines the case where loads can be freely moved along the grid, e.g., by a mobile robot, expanding the range of possible motions. We establish that for a range of scenarios, such as having limited prior knowledge of the loads' arrival sequences or grids with a narrow opening, a (best possible) rearrangement-free solution always exists, including when the loads fill the grid to its capacity. In particular, when the sequences are fully known, we establish an intriguing characterization showing that rearrangement can always be avoided if and only if the open side of the grid (used to access the storage) is at least 3 cells wide. We further discuss useful practical implications of our solutions.