Karl Löwenmark, Daniel Strömbergsson, Chang Liu, Marcus Liwicki, Fredrik Sandin
Condition monitoring (CM) plays a crucial role in ensuring reliability and efficiency in the process industry. Although computerised maintenance systems effectively detect and classify faults, tasks like fault severity estimation, and maintenance decisions still largely depend on human expert analysis. The analysis and decision making automatically performed by current systems typically exhibit considerable uncertainty and high false alarm rates, leading to increased workload and reduced efficiency. This work integrates large language model (LLM)-based reasoning agents with CM workflows to address analyst and industry needs, namely reducing false alarms, enhancing fault severity estimation, improving decision support, and offering explainable interfaces. We propose MindRAG, a modular framework combining multimodal retrieval-augmented generation (RAG) with novel vector store structures designed specifically for CM data. The framework leverages existing annotations and maintenance work orders as surrogates for labels in a supervised learning protocol, addressing the common challenge of training predictive models on unlabelled and noisy real-world datasets. The primary contributions include: (1) an approach for structuring industry CM data into a semi-structured multimodal vector store compatible with LLM-driven workflows; (2) developing multimodal RAG techniques tailored for CM data; (3) developing practical reasoning agents capable of addressing real-world CM queries; and (4) presenting an experimental framework for integrating and evaluating such agents in realistic industrial scenarios. Preliminary results, evaluated with the help of an experienced analyst, indicate that MindRAG provide meaningful decision support for more efficient management of alarms, thereby improving the interpretability of CM systems.
Jialin Chen, Ziyu Zhao, Gaukhar Nurbek, Aosong Feng, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, Rex Ying
The ubiquity of dynamic data in domains such as weather, healthcare, and energy underscores a growing need for effective interpretation and retrieval of time-series data. These data are inherently tied to domain-specific contexts, such as clinical notes or weather narratives, making cross-modal retrieval essential not only for downstream tasks but also for developing robust time-series foundation models by retrieval-augmented generation (RAG). Despite the increasing demand, time-series retrieval remains largely underexplored. Existing methods often lack semantic grounding, struggle to align heterogeneous modalities, and have limited capacity for handling multi-channel signals. To address this gap, we propose TRACE, a generic multimodal retriever that grounds time-series embeddings in aligned textual context. TRACE enables fine-grained channel-level alignment and employs hard negative mining to facilitate semantically meaningful retrieval. It supports flexible cross-modal retrieval modes, including Text-to-Timeseries and Timeseries-to-Text, effectively linking linguistic descriptions with complex temporal patterns. By retrieving semantically relevant pairs, TRACE enriches downstream models with informative context, leading to improved predictive accuracy and interpretability. Beyond a static retrieval engine, TRACE also serves as a powerful standalone encoder, with lightweight task-specific tuning that refines context-aware representations while maintaining strong cross-modal alignment. These representations achieve state-of-the-art performance on downstream forecasting and classification tasks. Extensive experiments across multiple domains highlight its dual utility, as both an effective encoder for downstream applications and a general-purpose retriever to enhance time-series models.
Qinggang Zhang, Zhishang Xiang, Yilin Xiao, Le Wang, Junhui Li, Xinrun Wang, Jinsong Su
Large language models (LLMs) augmented with retrieval systems have
demonstrated significant potential in handling knowledge-intensive tasks.
However, these models often struggle with unfaithfulness issues, generating
outputs that either ignore the retrieved context or inconsistently blend it
with the LLM`s parametric knowledge. This issue is particularly severe in cases
of knowledge conflict, where the retrieved context conflicts with the model`s
parametric knowledge. While existing faithful RAG approaches enforce strict
context adherence through well-designed prompts or modified decoding
strategies, our analysis reveals a critical limitation: they achieve
faithfulness by forcibly suppressing the model`s parametric knowledge, which
undermines the model`s internal knowledge structure and increases the risk of
misinterpreting the context. To this end, this paper proposes FaithfulRAG, a
novel framework that resolves knowledge conflicts by explicitly modeling
discrepancies between the model`s parametric knowledge and retrieved context.
Specifically, FaithfulRAG identifies conflicting knowledge at the fact level
and designs a self-thinking process, allowing LLMs to reason about and
integrate conflicting facts before generating responses. Extensive experiments
demonstrate that our method outperforms state-of-the-art methods. The code is
available at https:// github.com/DeepLearnXMU/Faithful-RAG
Authors' comments: Qinggang Zhang and Zhishang Xiang contributed equally to this work.
Corresponding author: Jinsong Su
Leqi Shen, Guoqiang Gong, Tianxiang Hao, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Jungong Han et al.
The parameter-efficient adaptation of the image-text pretraining model CLIP
for video-text retrieval is a prominent area of research. While CLIP is focused
on image-level vision-language matching, video-text retrieval demands
comprehensive understanding at the video level. Three key discrepancies emerge
in the transfer from image-level to video-level: vision, language, and
alignment. However, existing methods mainly focus on vision while neglecting
language and alignment. In this paper, we propose Discrepancy Reduction in
Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all
three discrepancies. Specifically, we introduce Image-Video Features Fusion to
integrate image-level and video-level features, effectively tackling both
vision and language discrepancies. Additionally, we generate pseudo image
captions to learn fine-grained image-level alignment. To mitigate alignment
discrepancies, we propose Image-to-Video Alignment Distillation, which
leverages image-level alignment knowledge to enhance video-level alignment.
Extensive experiments demonstrate the superiority of our DiscoVLA. In
particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous
methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is
available at https://github.com/LunarShen/DsicoVLA.
Authors' comments: CVPR 2025
Minhae Oh, Jeonghye Kim, Nakyung Lee, Donggeon Seo, Taeuk Kim, Jungwoo Lee
Scientific reasoning requires not only long-chain reasoning processes, but also knowledge of domain-specific terminologies and adaptation to updated findings. To deal with these challenges for scientific reasoning, we introduce RAISE, a step-by-step retrieval-augmented framework which retrieves logically relevant documents from in-the-wild corpus. RAISE is divided into three steps: problem decomposition, logical query generation, and logical retrieval. We observe that RAISE consistently outperforms other baselines on scientific reasoning benchmarks. We analyze that unlike other baselines, RAISE retrieves documents that are not only similar in terms of the domain knowledge, but also documents logically more relevant.
Liyan Xu, Zhenlin Su, Mo Yu, Jiangnan Li, Fandong Meng, Jie Zhou
This work focuses on an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within the semantics, resulting in failed dense retrieval on even simple cases. To examine such behaviors, we first introduce a new evaluation dataset in Chinese, named CapRetrieval, whose passages are image captions, and queries are phrases inquiring entities or events in various forms. Zero-shot evaluation suggests that encoders may fail on these fine-grained matching, regardless of training sources or model sizes. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, which obtains the best performance on CapRetrieval. Within this process, we further identify an issue of granularity dilemma, a challenge for embeddings to express fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.
Anvi Alex Eponon, Moein Shahiki-Tash, Ildar Batyrshin, Christian E. Maldonado-Sifuentes, Grigori Sidorov, Alexander Gelbukh
This study presents a question-based knowledge encoding approach that improves retrieval-augmented generation (RAG) systems without requiring fine-tuning or traditional chunking. We encode textual content using generated questions that span the lexical and semantic space, creating targeted retrieval cues combined with a custom syntactic reranking method. In single-hop retrieval over 109 scientific papers, our approach achieves a Recall@3 of 0.84, outperforming traditional chunking methods by 60 percent. We also introduce "paper-cards", concise paper summaries under 300 characters, which enhance BM25 retrieval, increasing MRR@3 from 0.56 to 0.85 on simplified technical queries. For multihop tasks, our reranking method reaches an F1 score of 0.52 with LLaMA2-Chat-7B on the LongBench 2WikiMultihopQA dataset, surpassing chunking and fine-tuned baselines which score 0.328 and 0.412 respectively. This method eliminates fine-tuning requirements, reduces retrieval latency, enables intuitive question-driven knowledge access, and decreases vector storage demands by 80%, positioning it as a scalable and efficient RAG alternative.
Zhongze Luo, Weixuan Wan, Qizhi Zheng, Yanhong Bai, Jingyun Sun, Jian Wang, Dan Wang
There are many types of standards in the field of communication. The
traditional consulting model has a long cycle and relies on the knowledge and
experience of experts, making it difficult to meet the rapidly developing
technological demands. This paper combines the fine-tuning of large language
models with the construction of knowledge graphs to implement an intelligent
consultation and question-answering system for communication standards. The
experimental results show that after LoRA tuning on the constructed dataset of
6,587 questions and answers in the field of communication standards,
Qwen2.5-7B-Instruct demonstrates outstanding professional capabilities in the
field of communication standards on the test set. BLEU-4 rose from 18.8564 to
66.8993, and evaluation indicators such as ROUGE also increased significantly,
outperforming the fine-tuning effect of the comparison model
Llama-3-8B-Instruct. Based on the ontology framework containing 6 entity
attributes and 10 relation attributes, a knowledge graph of the communication
standard domain containing 13,906 entities and 13,524 relations was
constructed, showing a relatively good query accuracy rate. The intelligent
consultation and question-answering system enables the fine-tuned model on the
server side to access the locally constructed knowledge graph and conduct
graphical retrieval of key information first, which is conducive to improving
the question-answering effect. The evaluation using DeepSeek as the Judge on
the test set shows that our RAG framework enables the fine-tuned model to
improve the scores at all five angles, with an average score increase of 2.26%.
And combined with web services and API interfaces, it has achieved very good
results in terms of interaction experience and back-end access, and has very
good practical application value.
Authors' comments: 23 pages
Yu Li, Xingyu Qiu, Yuqian Fu, Jie Chen, Tianwen Qian, Xu Zheng, Danda Pani Paudel, Yanwei Fu et al.
Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance.
Ze Yu Zhang, Zitao Li, Yaliang Li, Bolin Ding, Bryan Kian Hsiang Low
Retrieval-augmented generation (RAG) based on large language models often
falters on narrative documents with inherent temporal structures. Standard
unstructured RAG methods rely solely on embedding-similarity matching and lack
any general mechanism to encode or exploit chronological information, while
knowledge graph RAG (KG-RAG) frameworks collapse every mention of an entity
into a single node, erasing the evolving context that drives many queries. To
formalize this challenge and draw the community's attention, we construct
ChronoQA, a robust and discriminative QA benchmark that measures temporal,
causal, and character consistency understanding in narrative documents (e.g.,
novels) under the RAG setting. We then introduce Entity-Event RAG (E^2RAG), a
dual-graph framework that keeps separate entity and event subgraphs linked by a
bipartite mapping, thereby preserving the temporal and causal facets needed for
fine-grained reasoning. Across ChronoQA, our approach outperforms
state-of-the-art unstructured and KG-based RAG baselines, with notable gains on
causal and character consistency queries. E^2RAG therefore offers a practical
path to more context-aware retrieval for tasks that require precise answers
grounded in chronological information.
Authors' comments: 24 pages, 4 figures
Haowei Wang, Rupeng Zhang, Junjie Wang, Mingyang Li, Yuekai Huang, Dandan Wang, Qing Wang
Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by retrieving relevant documents from external corpora before generating responses. This approach significantly expands LLM capabilities by leveraging vast, up-to-date external knowledge. However, this reliance on external knowledge makes RAG systems vulnerable to corpus poisoning attacks that manipulate generated outputs via poisoned document injection. Existing poisoning attack strategies typically treat the retrieval and generation stages as disjointed, limiting their effectiveness. We propose Joint-GCG, the first framework to unify gradient-based attacks across both retriever and generator models through three innovations: (1) Cross-Vocabulary Projection for aligning embedding spaces, (2) Gradient Tokenization Alignment for synchronizing token-level gradient signals, and (3) Adaptive Weighted Fusion for dynamically balancing attacking objectives. Evaluations demonstrate that Joint-GCG achieves at most 25% and an average of 5% higher attack success rate than previous methods across multiple retrievers and generators. While optimized under a white-box assumption, the generated poisons show unprecedented transferability to unseen models. Joint-GCG's innovative unification of gradient-based attacks across retrieval and generation stages fundamentally reshapes our understanding of vulnerabilities within RAG systems. Our code is available at https://github.com/NicerWang/Joint-GCG.
Animesh Gupta, Jay Parmar, Ishan Rajendrakumar Dave, Mubarak Shah
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 25.82.
Chenyu Lin, Yilin Wen, Du Su, Fei Sun, Muhan Chen, Chenfu Bao, Zhonghou Lv
Retrieval-augmented generation (RAG) is a mainstream method for improving performance on knowledge-intensive tasks. However,current RAG systems often place too much emphasis on retrieved contexts. This can lead to reliance on inaccurate sources and overlook the model's inherent knowledge, especially when dealing with misleading or excessive information. To resolve this imbalance, we propose Knowledgeable-r1 that using joint sampling and define multi policy distributions in knowledge capability exploration to stimulate large language models'self-integrated utilization of parametric and contextual knowledge. Experiments show that Knowledgeable-r1 significantly enhances robustness and reasoning accuracy in both parameters and contextual conflict tasks and general RAG tasks, especially outperforming baselines by 17.07% in counterfactual scenarios and demonstrating consistent gains across RAG tasks. Our code are available at https://github.com/lcy80366872/ knowledgeable-r1.
Gengluo Li, Huawen Shen, Yu Zhou
Chinese scene text retrieval is a practical task that aims to search for images containing visual instances of a Chinese query text. This task is extremely challenging because Chinese text often features complex and diverse layouts in real-world scenes. Current efforts tend to inherit the solution for English scene text retrieval, failing to achieve satisfactory performance. In this paper, we establish a Diversified Layout benchmark for Chinese Street View Text Retrieval (DL-CSVTR), which is specifically designed to evaluate retrieval performance across various text layouts, including vertical, cross-line, and partial alignments. To address the limitations in existing methods, we propose Chinese Scene Text Retrieval CLIP (CSTR-CLIP), a novel model that integrates global visual information with multi-granularity alignment training. CSTR-CLIP applies a two-stage training process to overcome previous limitations, such as the exclusion of visual features outside the text region and reliance on single-granularity alignment, thereby enabling the model to effectively handle diverse text layouts. Experiments on existing benchmark show that CSTR-CLIP outperforms the previous state-of-the-art model by 18.82% accuracy and also provides faster inference speed. Further analysis on DL-CSVTR confirms the superior performance of CSTR-CLIP in handling various text layouts. The dataset and code will be publicly available to facilitate research in Chinese scene text retrieval.
Yubo Ma, Jinsong Li, Yuhang Zang, Xiaobao Wu, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan et al.
Despite the strong performance of ColPali/ColQwen2 in Visualized Document
Retrieval (VDR), it encodes each page into multiple patch-level embeddings and
leads to excessive memory usage. This empirical study investigates methods to
reduce patch embeddings per page at minimum performance degradation. We
evaluate two token-reduction strategies: token pruning and token merging.
Regarding token pruning, we surprisingly observe that a simple random strategy
outperforms other sophisticated pruning methods, though still far from
satisfactory. Further analysis reveals that pruning is inherently unsuitable
for VDR as it requires removing certain page embeddings without query-specific
information. Turning to token merging (more suitable for VDR), we search for
the optimal combinations of merging strategy across three dimensions and
develop Light-ColPali/ColQwen2. It maintains 98.2% of retrieval performance
with only 11.8% of original memory usage, and preserves 94.6% effectiveness at
2.8% memory footprint. We expect our empirical findings and resulting
Light-ColPali/ColQwen2 offer valuable insights and establish a competitive
baseline for future research towards efficient VDR.
Authors' comments: Accepted by ACL 2025 findings
Hong Gao, Yiming Bao, Xuezhan Tu, Bin Zhong, Minling Zhang
Current video-based multimodal large language models struggle with hour-level video understanding due to computational constraints and inefficient information extraction from extensive temporal sequences. We propose APVR (Adaptive Pivot Visual information Retrieval), a training-free framework that addresses the memory wall limitation through hierarchical visual information retrieval. APVR operates via two complementary components: Pivot Frame Retrieval employs semantic expansion and multi-modal confidence scoring to identify semantically relevant video frames, while Pivot Token Retrieval performs query-aware attention-driven token selection within the pivot frames. This dual granularity approach enables processing of hour-long videos while maintaining semantic fidelity. Experimental validation on LongVideoBench and VideoMME demonstrates significant performance improvements, establishing state-of-the-art results for not only training-free but also training-based approaches while providing plug-and-play integration capability with existing MLLM architectures.
Lingyuan Liu, Mengxiang Zhang
Large language models (LLMs)-based query expansion for information retrieval augments queries with generated hypothetical documents with LLMs. However, its performance relies heavily on the scale of the language models (LMs), necessitating larger, more advanced LLMs. This approach is costly, computationally intensive, and often has limited accessibility. To address these limitations, we introduce GOLFer - Smaller LMs-Generated Documents Hallucination Filter & Combiner - a novel method leveraging smaller open-source LMs for query expansion. GOLFer comprises two modules: a hallucination filter and a documents combiner. The former detects and removes non-factual and inconsistent sentences in generated documents, a common issue with smaller LMs, while the latter combines the filtered content with the query using a weight vector to balance their influence. We evaluate GOLFer alongside dominant LLM-based query expansion methods on three web search and ten low-resource datasets. Experimental results demonstrate that GOLFer consistently outperforms other methods using smaller LMs, and maintains competitive performance against methods using large-size LLMs, demonstrating its effectiveness.
Lingyuan Liu, Mengxiang Zhang
Large Language Models (LLMs) have shown potential in generating hypothetical documents for query expansion, thereby enhancing information retrieval performance. However, the efficacy of this method is highly dependent on the quality of the generated documents, which often requires complex prompt strategies and the integration of advanced dense retrieval techniques. This can be both costly and computationally intensive. To mitigate these limitations, we explore the use of zero-shot LLM-based query expansion to improve sparse retrieval, particularly for learned sparse retrievers. We introduce a novel fusion ranking framework, Exp4Fuse, which enhances the performance of sparse retrievers through an indirect application of zero-shot LLM-based query expansion. Exp4Fuse operates by simultaneously considering two retrieval routes-one based on the original query and the other on the LLM-augmented query. It then generates two ranked lists using a sparse retriever and fuses them using a modified reciprocal rank fusion method. We conduct extensive evaluations of Exp4Fuse against leading LLM-based query expansion methods and advanced retrieval techniques on three MS MARCO-related datasets and seven low-resource datasets. Experimental results reveal that Exp4Fuse not only surpasses existing LLM-based query expansion methods in enhancing sparse retrievers but also, when combined with advanced sparse retrievers, achieves SOTA results on several benchmarks. This highlights the superior performance and effectiveness of Exp4Fuse in improving query expansion for sparse retrieval.
Hongjun Liu, Yilun Zhao, Arman Cohan, Chen Zhao
Automatic fact-checking has recently received more attention as a means of
combating misinformation. Despite significant advancements, fact-checking
systems based on retrieval-augmented language models still struggle to tackle
adversarial claims, which are intentionally designed by humans to challenge
fact-checking systems. To address these challenges, we propose a training-free
method designed to rephrase the original claim, making it easier to locate
supporting evidence. Our modular framework, SUCEA, decomposes the task into
three steps: 1) Claim Segmentation and Decontextualization that segments
adversarial claims into independent sub-claims; 2) Iterative Evidence Retrieval
and Claim Editing that iteratively retrieves evidence and edits the subclaim
based on the retrieved evidence; 3) Evidence Aggregation and Label Prediction
that aggregates all retrieved evidence and predicts the entailment label.
Experiments on two challenging fact-checking datasets demonstrate that our
framework significantly improves on both retrieval and entailment label
accuracy, outperforming four strong claim-decomposition-based baselines.
Authors' comments: 16 pages, 10 figures, 7 tables
Haiye Huo, Li Xiao
Sparse phase retrieval with redundant dictionary is to reconstruct the
signals of interest that are (nearly) sparse in a redundant dictionary or frame
from the phaseless measurements via the optimization models. Gao [7] presented
conditions on the measurement matrix, called null space property (NSP) and
strong dictionary restricted isometry property (S-DRIP), for exact and stable
recovery of dictionary-$k$-sparse signals via the $\ell_1$-analysis model for
sparse phase retrieval with redundant dictionary, respectively, where, in
particularly, the S-DRIP of order $tk$ with $t>1$ was derived. In this paper,
motivated by many advantages of the $\ell_q$ minimization with $0<q\leq1$,
e.g., reduction of the number of measurements required, we generalize these two
conditions to the $\ell_q$-analysis model. Specifically, we first present two
NSP variants for exact recovery of dictionary-$k$-sparse signals via the
$\ell_q$-analysis model in the noiseless scenario. Moreover, we investigate the
S-DRIP of order $tk$ with $0<t<\frac{4}{3}$ for stable recovery of
dictionary-$k$-sparse signals via the $\ell_q$-analysis model in the noisy
scenario, which will complement the existing result of the S-DRIP of order $tk$
with $t\geq2$ obtained in [4].
Authors' comments: 21 Pages