Hongye Xu, Jan Wasilewski, Bartosz Krawczyk
Continual learning in deep neural networks often suffers from catastrophic forgetting, where representations for previous tasks are overwritten during subsequent training. We propose a novel sample retrieval strategy from the memory buffer that leverages both gradient-conflicting and gradient-aligned samples to effectively retain knowledge about past tasks within a supervised contrastive learning framework. Gradient-conflicting samples are selected for their potential to reduce interference by re-aligning gradients, thereby preserving past task knowledge. Meanwhile, gradient-aligned samples are incorporated to reinforce stable, shared representations across tasks. By balancing gradient correction from conflicting samples with alignment reinforcement from aligned ones, our approach increases the diversity among retrieved instances and achieves superior alignment in parameter space, significantly enhancing knowledge retention and mitigating proxy drift. Empirical results demonstrate that using both sample types outperforms methods relying solely on one sample type or random retrieval. Experiments on popular continual learning benchmarks in computer vision validate our method's state-of-the-art performance in mitigating forgetting while maintaining competitive accuracy on new tasks.
Ahmet Yasin Aytar, Kemal Kilic, Kamer Kaya
In the rapidly evolving field of data science, efficiently navigating the expansive body of academic literature is crucial for informed decision-making and innovation. This paper presents an enhanced Retrieval-Augmented Generation (RAG) application, an artificial intelligence (AI)-based system designed to assist data scientists in accessing precise and contextually relevant academic resources. The AI-powered application integrates advanced techniques, including the GeneRation Of BIbliographic Data (GROBID) technique for extracting bibliographic information, fine-tuned embedding models, semantic chunking, and an abstract-first retrieval method, to significantly improve the relevance and accuracy of the retrieved information. This implementation of AI specifically addresses the challenge of academic literature navigation. A comprehensive evaluation using the Retrieval-Augmented Generation Assessment System (RAGAS) framework demonstrates substantial improvements in key metrics, particularly Context Relevance, underscoring the system's effectiveness in reducing information overload and enhancing decision-making processes. Our findings highlight the potential of this enhanced Retrieval-Augmented Generation system to transform academic exploration within data science, ultimately advancing the workflow of research and innovation in the field.
Samantha J Alloo, Kaye S Morgan
X-ray attenuation, phase, and dark-field images provide complementary information. Different experimental techniques can capture these contrast mechanisms, and the corresponding images can be retrieved using various theoretical algorithms. Our previous works developed the Multimodal Intrinsic Speckle-Tracking (MIST) algorithm, which is suitable for multimodal image retrieval from speckle-based X-ray imaging (SBXI) data. MIST is based on the X-ray Fokker-Planck equation, requiring the inversion of derivative operators that are often numerically unstable. These instabilities can be addressed by employing regularization techniques, such as Tikhonov regularization. The regularization output is highly sensitive to the choice of the Tikhonov regularization parameter, making it crucial to select this value carefully and optimally. Here, we present an automated iterative algorithm to optimize the regularization of the inverse Laplacian operator in our most recently published MIST variant, addressing the operator's instability near the Fourier-space origin. Our algorithm leverages the inherent stability of the phase solution obtained from the transport-of-intensity equation for SBXI, using it as a reliable ground truth for the more complex Fokker-Planck-based algorithms that incorporate the dark-field signal. We applied the algorithm to an SBXI dataset collected using synchrotron light of a four-rod sample. The four-rod sample's phase and dark-field images were optimally retrieved using our developed algorithm, eliminating the tedious and subjective task of selecting a suitable Tikhonov regularization parameter. The developed regularization-optimization algorithm makes MIST more user-friendly by eliminating the need for manual parameter selection. We anticipate that our optimization algorithm can also be applied to other image retrieval approaches derived from the Fokker-Planck equation.
Qianren Mao, Yangyifei Luo, Qili Zhang, Yashuo Luo, Zhilong Cao, Jinlong Zhang, HanWen Hao, Zhijun Chen et al.
Retrieval-augmented generation (RAG) synergizes the retrieval of pertinent data with the generative capabilities of Large Language Models (LLMs), ensuring that the generated output is not only contextually relevant but also accurate and current. We introduce XRAG, an open-source, modular codebase that facilitates exhaustive evaluation of the performance of foundational components of advanced RAG modules. These components are systematically categorized into four core phases: pre-retrieval, retrieval, post-retrieval, and generation. We systematically analyse them across reconfigured datasets, providing a comprehensive benchmark for their effectiveness. As the complexity of RAG systems continues to escalate, we underscore the critical need to identify potential failure points in RAG systems. We formulate a suite of experimental methodologies and diagnostic testing protocols to dissect the failure points inherent in RAG engineering. Subsequently, we proffer bespoke solutions aimed at bolstering the overall performance of these modules. Our work thoroughly evaluates the performance of advanced core components in RAG systems, providing insights into optimizations for prevalent failure points.
Giacomo Pacini, Fabio Carrara, Nicola Messina, Nicola Tonellotto, Giuseppe Amato, Fabrizio Falchi
Query suggestion, a technique widely adopted in information retrieval,
enhances system interactivity and the browsing experience of document
collections. In cross-modal retrieval, many works have focused on retrieving
relevant items from natural language queries, while few have explored query
suggestion solutions. In this work, we address query suggestion in cross-modal
retrieval, introducing a novel task that focuses on suggesting minimal textual
modifications needed to explore visually consistent subsets of the collection,
following the premise of ''Maybe you are looking for''. To facilitate the
evaluation and development of methods, we present a tailored benchmark named
CroQS. This dataset comprises initial queries, grouped result sets, and
human-defined suggested queries for each group. We establish dedicated metrics
to rigorously evaluate the performance of various methods on this task,
measuring representativeness, cluster specificity, and similarity of the
suggested queries to the original ones. Baseline methods from related fields,
such as image captioning and content summarization, are adapted for this task
to provide reference performance scores. Although relatively far from human
performance, our experiments reveal that both LLM-based and captioning-based
methods achieve competitive results on CroQS, improving the recall on cluster
specificity by more than 115% and representativeness mAP by more than 52% with
respect to the initial query. The dataset, the implementation of the baseline
methods and the notebooks containing our experiments are available here:
https://paciosoft.com/CroQS-benchmark/
Authors' comments: 15 pages, 5 figures. To be published as full paper in the Proceedings
of the European Conference on Information Retrieval (ECIR) 2025
Zhuoran Jin, Hongbang Yuan, Tianyi Men, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Despite the significant progress made by existing retrieval augmented
language models (RALMs) in providing trustworthy responses and grounding in
reliable sources, they often overlook effective alignment with human
preferences. In the alignment process, reward models (RMs) act as a crucial
proxy for human values to guide optimization. However, it remains unclear how
to evaluate and select a reliable RM for preference alignment in RALMs. To this
end, we propose RAG-RewardBench, the first benchmark for evaluating RMs in RAG
settings. First, we design four crucial and challenging RAG-specific scenarios
to assess RMs, including multi-hop reasoning, fine-grained citation,
appropriate abstain, and conflict robustness. Then, we incorporate 18 RAG
subsets, six retrievers, and 24 RALMs to increase the diversity of data
sources. Finally, we adopt an LLM-as-a-judge approach to improve preference
annotation efficiency and effectiveness, exhibiting a strong correlation with
human annotations. Based on the RAG-RewardBench, we conduct a comprehensive
evaluation of 45 RMs and uncover their limitations in RAG scenarios.
Additionally, we also reveal that existing trained RALMs show almost no
improvement in preference alignment, highlighting the need for a shift towards
preference-aligned training.We release our benchmark and code publicly at
https://huggingface.co/datasets/jinzhuoran/RAG-RewardBench/ for future work.
Authors' comments: 26 pages, 12 figures, 6 tables
Yunbin Tu, Liang Li, Li Su, Qingming Huang
Video has emerged as a favored multimedia format on the internet. To better
gain video contents, a new topic HIREST is presented, including video
retrieval, moment retrieval, moment segmentation, and step-captioning. The
pioneering work chooses the pre-trained CLIP-based model for video retrieval,
and leverages it as a feature extractor for other three challenging tasks
solved in a multi-task learning paradigm. Nevertheless, this work struggles to
learn the comprehensive cognition of user-preferred content, due to
disregarding the hierarchies and association relations across modalities. In
this paper, guided by the shallow-to-deep principle, we propose a query-centric
audio-visual cognition (QUAG) network to construct a reliable multi-modal
representation for moment retrieval, segmentation and step-captioning.
Specifically, we first design the modality-synergistic perception to obtain
rich audio-visual content, by modeling global contrastive alignment and local
fine-grained interaction between visual and audio modalities. Then, we devise
the query-centric cognition that uses the deep-level query to perform the
temporal-channel filtration on the shallow-level audio-visual representation.
This can cognize user-preferred content and thus attain a query-centric
audio-visual representation for three tasks. Extensive experiments show QUAG
achieves the SOTA results on HIREST. Further, we test QUAG on the query-based
video summarization task and verify its good generalization.
Authors' comments: Accepted by AAAI 2025
Rui Cai, Zhiyu Dong, Jianfeng Dong, Xun Wang
Existing cross-modal retrieval methods typically rely on large-scale
vision-language pair data. This makes it challenging to efficiently develop a
cross-modal retrieval model for under-resourced languages of interest.
Therefore, Cross-lingual Cross-modal Retrieval (CCR), which aims to align
vision and the low-resource language (the target language) without using any
human-labeled target-language data, has gained increasing attention. As a
general parameter-efficient way, a common solution is to utilize adapter
modules to transfer the vision-language alignment ability of Vision-Language
Pretraining (VLP) models from a source language to a target language. However,
these adapters are usually static once learned, making it difficult to adapt to
target-language captions with varied expressions. To alleviate it, we propose
Dynamic Adapter with Semantics Disentangling (DASD), whose parameters are
dynamically generated conditioned on the characteristics of the input captions.
Considering that the semantics and expression styles of the input caption
largely influence how to encode it, we propose a semantic disentangling module
to extract the semantic-related and semantic-agnostic features from the input,
ensuring that generated adapters are well-suited to the characteristics of
input caption. Extensive experiments on two image-text datasets and one
video-text dataset demonstrate the effectiveness of our model for cross-lingual
cross-modal retrieval, as well as its good compatibility with various VLP
models.
Authors' comments: Accepted by the 39th AAAI Conference on Artificial Intelligence
(AAAI-25)
Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Wayne Xin Zhao, Yang Song, Tao Zhang
Existing large language models (LLMs) show exceptional problem-solving
capabilities but might struggle with complex reasoning tasks. Despite the
successes of chain-of-thought and tree-based search methods, they mainly depend
on the internal knowledge of LLMs to search over intermediate reasoning steps,
limited to dealing with simple tasks involving fewer reasoning steps. In this
paper, we propose \textbf{RAG-Star}, a novel RAG approach that integrates the
retrieved information to guide the tree-based deliberative reasoning process
that relies on the inherent knowledge of LLMs. By leveraging Monte Carlo Tree
Search, RAG-Star iteratively plans intermediate sub-queries and answers for
reasoning based on the LLM itself. To consolidate internal and external
knowledge, we propose an retrieval-augmented verification that utilizes query-
and answer-aware reward modeling to provide feedback for the inherent reasoning
of LLMs. Our experiments involving Llama-3.1-8B-Instruct and GPT-4o demonstrate
that RAG-Star significantly outperforms previous RAG and reasoning methods.
Authors' comments: LLM;RAG;MCTS
Kanghoon Yoon, Kibum Kim, Jaehyung Jeon, Yeonjun In, Donghyun Kim, Chanyoung Park
Scene Graph Generation (SGG) research has suffered from two fundamental
challenges: the long-tailed predicate distribution and semantic ambiguity
between predicates. These challenges lead to a bias towards head predicates in
SGG models, favoring dominant general predicates while overlooking fine-grained
predicates. In this paper, we address the challenges of SGG by framing it as
multi-label classification problem with partial annotation, where relevant
labels of fine-grained predicates are missing. Under the new frame, we propose
Retrieval-Augmented Scene Graph Generation (RA-SGG), which identifies potential
instances to be multi-labeled and enriches the single-label with multi-labels
that are semantically similar to the original label by retrieving relevant
samples from our established memory bank. Based on augmented relations (i.e.,
discovered multi-labels), we apply multi-prototype learning to train our SGG
model. Several comprehensive experiments have demonstrated that RA-SGG
outperforms state-of-the-art baselines by up to 3.6% on VG and 5.9% on GQA,
particularly in terms of F@K, showing that RA-SGG effectively alleviates the
issue of biased prediction caused by the long-tailed distribution and semantic
ambiguity of predicates.
Authors' comments: 7 pages
Yun Luo, Yingjie Li, Xiangkun Hu, Qinglin Qi, Fang Guo, Qipeng Guo, Zheng Zhang, Yue Zhang
As online platforms and recommendation algorithms evolve, people are increasingly trapped in echo chambers, leading to biased understandings of various issues. To combat this issue, we have introduced PerSphere, a benchmark designed to facilitate multi-faceted perspective retrieval and summarization, thus breaking free from these information silos. For each query within PerSphere, there are two opposing claims, each supported by distinct, non-overlapping perspectives drawn from one or more documents. Our goal is to accurately summarize these documents, aligning the summaries with the respective claims and their underlying perspectives. This task is structured as a two-step end-to-end pipeline that includes comprehensive document retrieval and multi-faceted summarization. Furthermore, we propose a set of metrics to evaluate the comprehensiveness of the retrieval and summarization content. Experimental results on various counterparts for the pipeline show that recent models struggle with such a complex task. Analysis shows that the main challenge lies in long context and perspective extraction, and we propose a simple but effective multi-agent summarization system, offering a promising solution to enhance performance on PerSphere.
Umer Butt, Stalin Veranasi, Günter Neumann
As the Information Retrieval (IR) field increasingly recognizes the
importance of inclusivity, addressing the needs of low-resource languages
remains a significant challenge. This paper introduces the first large-scale
Urdu IR dataset, created by translating the MS MARCO dataset through machine
translation. We establish baseline results through zero-shot learning for IR in
Urdu and subsequently apply the mMARCO multilingual IR methodology to this
newly translated dataset. Our findings demonstrate that the fine-tuned model
(Urdu-mT5-mMARCO) achieves a Mean Reciprocal Rank (MRR@10) of 0.247 and a
Recall@10 of 0.439, representing significant improvements over zero-shot
results and showing the potential for expanding IR access for Urdu speakers. By
bridging access gaps for speakers of low-resource languages, this work not only
advances multilingual IR research but also emphasizes the ethical and societal
importance of inclusive IR technologies. This work provides valuable insights
into the challenges and solutions for improving language representation and
lays the groundwork for future research, especially in South Asian languages,
which can benefit from the adaptable methods used in this study.
Authors' comments: 7 pages, ECIR 2025, conference camera-ready version
Gongbo Zhang, Zihan Xu, Qiao Jin, Fangyi Chen, Yilu Fang, Yi Liu, Justin F. Rousseau, Ziyang Xu et al.
While holding great promise for improving and facilitating healthcare, large language models (LLMs) struggle to produce up-to-date responses on evolving topics due to outdated knowledge or hallucination. Retrieval-augmented generation (RAG) is a pivotal innovation that improves the accuracy and relevance of LLM responses by integrating LLMs with a search engine and external sources of knowledge. However, the quality of RAG responses can be largely impacted by the rank and density of key information in the retrieval results, such as the "lost-in-the-middle" problem. In this work, we aim to improve the robustness and reliability of the RAG workflow in the medical domain. Specifically, we propose a map-reduce strategy, BriefContext, to combat the "lost-in-the-middle" issue without modifying the model weights. We demonstrated the advantage of the workflow with various LLM backbones and on multiple QA datasets. This method promises to improve the safety and reliability of LLMs deployed in healthcare domains.
Yuzheng Cai, Zhenyue Guo, Yiwen Pei, Wanrui Bian, Weiguo Zheng
Recent advancements in large language models (LLMs) have shown impressive versatility across various tasks. To eliminate its hallucinations, retrieval-augmented generation (RAG) has emerged as a powerful approach, leveraging external knowledge sources like knowledge graphs (KGs). In this paper, we study the task of KG-driven RAG and propose a novel Similar Graph Enhanced Retrieval-Augmented Generation (SimGRAG) method. It effectively addresses the challenge of aligning query texts and KG structures through a two-stage process: (1) query-to-pattern, which uses an LLM to transform queries into a desired graph pattern, and (2) pattern-to-subgraph, which quantifies the alignment between the pattern and candidate subgraphs using a graph semantic distance (GSD) metric. We also develop an optimized retrieval algorithm that efficiently identifies the top-$k$ subgraphs within 1-second latency on a 10-million-scale KG. Extensive experiments show that SimGRAG outperforms state-of-the-art KG-driven RAG methods in both question answering and fact verification, offering superior plug-and-play usability and scalability.
Robert Litschko, Oliver Kraus, Verena Blaschke, Barbara Plank
A large amount of local and culture-specific knowledge (e.g., people,
traditions, food) can only be found in documents written in dialects. While
there has been extensive research conducted on cross-lingual information
retrieval (CLIR), the field of cross-dialect retrieval (CDIR) has received
limited attention. Dialect retrieval poses unique challenges due to the limited
availability of resources to train retrieval models and the high variability in
non-standardized languages. We study these challenges on the example of German
dialects and introduce the first German dialect retrieval dataset, dubbed
WikiDIR, which consists of seven German dialects extracted from Wikipedia.
Using WikiDIR, we demonstrate the weakness of lexical methods in dealing with
high lexical variation in dialects. We further show that commonly used
zero-shot cross-lingual transfer approach with multilingual encoders do not
transfer well to extremely low-resource setups, motivating the need for
resource-lean and dialect-specific retrieval models. We finally demonstrate
that (document) translation is an effective way to reduce the dialect gap in
CDIR.
Authors' comments: Accepted at COLING 2025
Elvis Nunez, Luca Zancato, Benjamin Bowman, Aditya Golatkar, Wei Xia, Stefano Soatto
The "state" of State Space Models (SSMs) represents their memory, which fades exponentially over an unbounded span. By contrast, Attention-based models have "eidetic" (i.e., verbatim, or photographic) memory over a finite span (context size). Hybrid architectures combine State Space layers with Attention, but still cannot recall the distant past and can access only the most recent tokens eidetically. Unlike current methods of combining SSM and Attention layers, we allow the state to be allocated based on relevancy rather than recency. In this way, for every new set of query tokens, our models can "eidetically" access tokens from beyond the Attention span of current Hybrid SSMs without requiring extra hardware resources. We introduce a method to expand the memory span of the hybrid state by "reserving" a fraction of the Attention context for tokens retrieved from arbitrarily distant in the past, thus expanding the eidetic memory span of the overall state. We call this reserved fraction of tokens the "expansion span," and the mechanism to retrieve and aggregate it "Span-Expanded Attention" (SE-Attn). To adapt Hybrid models to using SE-Attn, we propose a novel fine-tuning method that extends LoRA to Hybrid models (HyLoRA) and allows efficient adaptation on long spans of tokens. We show that SE-Attn enables us to efficiently adapt pre-trained Hybrid models on sequences of tokens up to 8 times longer than the ones used for pre-training. We show that HyLoRA with SE-Attn is cheaper and more performant than alternatives like LongLoRA when applied to Hybrid models on natural language benchmarks with long-range dependencies, such as PG-19, RULER, and other common natural language downstream tasks.
Ioannis Papadimitriou, Ilias Gialampoukidis, Stefanos Vrochidis, Ioannis, Kompatsiaris
We present RAG Playground, an open-source framework for systematic evaluation
of Retrieval-Augmented Generation (RAG) systems. The framework implements and
compares three retrieval approaches: naive vector search, reranking, and hybrid
vector-keyword search, combined with ReAct agents using different prompting
strategies. We introduce a comprehensive evaluation framework with novel
metrics and provide empirical results comparing different language models
(Llama 3.1 and Qwen 2.5) across various retrieval configurations. Our
experiments demonstrate significant performance improvements through hybrid
search methods and structured self-evaluation prompting, achieving up to 72.7%
pass rate on our multi-metric evaluation framework. The results also highlight
the importance of prompt engineering in RAG systems, with our custom-prompted
agents showing consistent improvements in retrieval accuracy and response
quality.
Authors' comments: Work In Progress
Rajat Khanda
Technical troubleshooting in enterprise environments often involves navigating diverse, heterogeneous data sources to resolve complex issues effectively. This paper presents a novel agentic AI solution built on a Weighted Retrieval-Augmented Generation (RAG) Framework tailored for enterprise technical troubleshooting. By dynamically weighting retrieval sources such as product manuals, internal knowledge bases, FAQs, and troubleshooting guides based on query context, the framework prioritizes the most relevant data. For instance, it gives precedence to product manuals for SKU-specific queries while incorporating general FAQs for broader issues. The system employs FAISS for efficient dense vector search, coupled with a dynamic aggregation mechanism to seamlessly integrate results from multiple sources. A Llama-based self-evaluator ensures the contextual accuracy and confidence of the generated responses before delivering them. This iterative cycle of retrieval and validation enhances precision, diversity, and reliability in response generation. Preliminary evaluations on large enterprise datasets demonstrate the framework's efficacy in improving troubleshooting accuracy, reducing resolution times, and adapting to varied technical challenges. Future research aims to enhance the framework by integrating advanced conversational AI capabilities, enabling more interactive and intuitive troubleshooting experiences. Efforts will also focus on refining the dynamic weighting mechanism through reinforcement learning to further optimize the relevance and precision of retrieved information. By incorporating these advancements, the proposed framework is poised to evolve into a comprehensive, autonomous AI solution, redefining technical service workflows across enterprise settings.
Michael Shen, Muhammad Umar, Kiwan Maeng, G. Edward Suh, Udit Gupta
The rapid increase in the number of parameters in large language models (LLMs) has significantly increased the cost involved in fine-tuning and retraining LLMs, a necessity for keeping models up to date and improving accuracy. Retrieval-Augmented Generation (RAG) offers a promising approach to improving the capabilities and accuracy of LLMs without the necessity of retraining. Although RAG eliminates the need for continuous retraining to update model data, it incurs a trade-off in the form of slower model inference times. Resultingly, the use of RAG in enhancing the accuracy and capabilities of LLMs often involves diverse performance implications and trade-offs based on its design. In an effort to begin tackling and mitigating the performance penalties associated with RAG from a systems perspective, this paper introduces a detailed taxonomy and characterization of the different elements within the RAG ecosystem for LLMs that explore trade-offs within latency, throughput, and memory. Our study reveals underlying inefficiencies in RAG for systems deployment, that can result in TTFT latencies that are twice as long and unoptimized datastores that consume terabytes of storage.
Wenchao Gu, Ensheng Shi, Yanlin Wang, Lun Du, Shi Han, Hongyu Zhang, Dongmei Zhang, Michael R. Lyu
Code retrieval, which retrieves code snippets based on users' natural language descriptions, is widely used by developers and plays a pivotal role in real-world software development. The advent of deep learning has shifted the retrieval paradigm from lexical-based matching towards leveraging deep learning models to encode source code and queries into vector representations, facilitating code retrieval according to vector similarity. Despite the effectiveness of these models, managing large-scale code database presents significant challenges. Previous research proposes deep hashing-based methods, which generate hash codes for queries and code snippets and use Hamming distance for rapid recall of code candidates. However, this approach's reliance on linear scanning of the entire code base limits its scalability. To further improve the efficiency of large-scale code retrieval, we propose a novel approach SECRET (Scalable and Efficient Code Retrieval via SegmEnTed deep hashing). SECRET converts long hash codes calculated by existing deep hashing approaches into several short hash code segments through an iterative training strategy. After training, SECRET recalls code candidates by looking up the hash tables for each segment, the time complexity of recall can thus be greatly reduced. Extensive experimental results demonstrate that SECRET can drastically reduce the retrieval time by at least 95% while achieving comparable or even higher performance of existing deep hashing approaches. Besides, SECRET also exhibits superior performance and efficiency compared to the classical hash table-based approach known as LSH under the same number of hash tables.