Guangyuan Liu, Yinqiu Liu, Ruichen Zhang, Hongyang Du, Dusit Niyato, Zehui Xiong, Sumei Sun, Abbas Jamalipour
The rapid development of multimodal AI and Large Language Models (LLMs) has greatly enhanced real-time interaction, decision-making, and collaborative tasks. However, in wireless multi-agent scenarios, limited bandwidth poses significant challenges to exchanging semantically rich multimodal information efficiently. Traditional semantic communication methods, though effective, struggle with redundancy and loss of crucial details. To overcome these challenges, we propose a Retrieval-Augmented Multimodal Semantic Communication (RAMSemCom) framework. RAMSemCom incorporates iterative, retrieval-driven semantic refinement tailored for distributed multi-agent environments, enabling efficient exchange of critical multimodal elements through local caching and selective transmission. Our approach dynamically optimizes retrieval using deep reinforcement learning (DRL) to balance semantic fidelity with bandwidth constraints. A comprehensive case study on multi-agent autonomous driving demonstrates that our DRL-based retrieval strategy significantly improves task completion efficiency and reduces communication overhead compared to baseline methods.
Thushara Manjari Naduvilakandy, Hyeju Jang, Mohammad Al Hasan
Causality detection and mining are important tasks in information retrieval
due to their enormous use in information extraction, and knowledge graph
construction. To solve these tasks, in existing literature there exist several
solutions -- both unsupervised and supervised. However, the unsupervised
methods suffer from poor performance and they often require significant human
intervention for causal rule selection, leading to poor generalization across
different domains. On the other hand, supervised methods suffer from the lack
of large training datasets. Recently, large language models (LLMs) with
effective prompt engineering are found to be effective to overcome the issue of
unavailability of large training dataset. Yet, in existing literature, there
does not exist comprehensive works on causality detection and mining using LLM
prompting. In this paper, we present several retrieval-augmented generation
(RAG) based dynamic prompting schemes to enhance LLM performance in causality
detection and extraction tasks. Extensive experiments over three datasets and
five LLMs validate the superiority of our proposed RAG-based dynamic prompting
over other static prompting schemes.
Authors' comments: 13 pages, 6 figures, published in knowledgeNLP-NAACL2025
Adriano Fragomeni, Dima Damen, Michael Wray
Text-to-Video (T2V) retrieval aims to identify the most relevant item from a gallery of videos based on a user's text query. Traditional methods rely solely on aligning video and text modalities to compute the similarity and retrieve relevant items. However, recent advancements emphasise incorporating auxiliary information extracted from video and text modalities to improve retrieval performance and bridge the semantic gap between these modalities. Auxiliary information can include visual attributes, such as objects; temporal and spatial context; and textual descriptions, such as speech and rephrased captions. This survey comprehensively reviews 81 research papers on Text-to-Video retrieval that utilise such auxiliary information. It provides a detailed analysis of their methodologies; highlights state-of-the-art results on benchmark datasets; and discusses available datasets and their auxiliary information. Additionally, it proposes promising directions for future research, focusing on different ways to further enhance retrieval performance using this information.
Michael Grohs, Adrian Rebmann, Jana-Rebecca Rehse
Conformance checking techniques detect undesired process behavior by
comparing process executions that are recorded in event logs to desired
behavior that is captured in a dedicated process model. If such models are not
available, conformance checking techniques are not applicable, but
organizations might still be interested in detecting undesired behavior in
their processes. To enable this, existing approaches use Large Language Models
(LLMs), assuming that they can learn to distinguish desired from undesired
behavior through fine-tuning. However, fine-tuning is highly resource-intensive
and the fine-tuned LLMs often do not generalize well. To address these
limitations, we propose an approach that requires neither a dedicated process
model nor resource-intensive fine-tuning to detect undesired process behavior.
Instead, we use Retrieval Augmented Generation (RAG) to provide an LLM with
direct access to a knowledge base that contains both desired and undesired
process behavior from other processes, assuming that the LLM can transfer this
knowledge to the process at hand. Our evaluation shows that our approach
outperforms fine-tuned LLMs in detecting undesired behavior, demonstrating that
RAG is a viable alternative to resource-intensive fine-tuning, particularly
when enriched with relevant context from the event log, such as frequent traces
and activities.
Authors' comments: Accepted at the BPM Forum, located at the International Conference on
Business Process Management (BPM) 2025
Shuyang Cao, Karthik Radhakrishnan, David Rosenberg, Steven Lu, Pengxiang Cheng, Lu Wang, Shiyue Zhang
Retrieval-augmented generation (RAG) generally enhances large language
models' (LLMs) ability to solve knowledge-intensive tasks. But RAG may also
lead to performance degradation due to imperfect retrieval and the model's
limited ability to leverage retrieved content. In this work, we evaluate the
robustness of LLMs in practical RAG setups (henceforth retrieval robustness).
We focus on three research questions: (1) whether RAG is always better than
non-RAG; (2) whether more retrieved documents always lead to better
performance; (3) and whether document orders impact results. To facilitate this
study, we establish a benchmark of 1500 open-domain questions, each with
retrieved documents from Wikipedia. We introduce three robustness metrics, each
corresponds to one research question. Our comprehensive experiments, involving
11 LLMs and 3 prompting strategies, reveal that all of these LLMs exhibit
surprisingly high retrieval robustness; nonetheless, different degrees of
imperfect robustness hinders them from fully utilizing the benefits of RAG.
Authors' comments: 19 pages
Chaeeun Kim, Jinu Lee, Wonseok Hwang
Legal Case Retrieval (LCR), which retrieves relevant cases from a query case,
is a fundamental task for legal professionals in research and decision-making.
However, existing studies on LCR face two major limitations. First, they are
evaluated on relatively small-scale retrieval corpora (e.g., 100-55K cases) and
use a narrow range of criminal query types, which cannot sufficiently reflect
the complexity of real-world legal retrieval scenarios. Second, their reliance
on embedding-based or lexical matching methods often results in limited
representations and legally irrelevant matches. To address these issues, we
present: (1) LEGAR BENCH, the first large-scale Korean LCR benchmark, covering
411 diverse crime types in queries over 1.2M legal cases; and (2)
LegalSearchLM, a retrieval model that performs legal element reasoning over the
query case and directly generates content grounded in the target cases through
constrained decoding. Experimental results show that LegalSearchLM outperforms
baselines by 6-20% on LEGAR BENCH, achieving state-of-the-art performance. It
also demonstrates strong generalization to out-of-domain cases, outperforming
naive generative models trained on in-domain data by 15%.
Authors' comments: Under review
Yunyi Zhang, Ruozhen Yang, Siqi Jiao, SeongKu Kang, Jiawei Han
Scientific paper retrieval is essential for supporting literature discovery and research. While dense retrieval methods demonstrate effectiveness in general-purpose tasks, they often fail to capture fine-grained scientific concepts that are essential for accurate understanding of scientific queries. Recent studies also use large language models (LLMs) for query understanding; however, these methods often lack grounding in corpus-specific knowledge and may generate unreliable or unfaithful content. To overcome these limitations, we propose SemRank, an effective and efficient paper retrieval framework that combines LLM-guided query understanding with a concept-based semantic index. Each paper is indexed using multi-granular scientific concepts, including general research topics and detailed key phrases. At query time, an LLM identifies core concepts derived from the corpus to explicitly capture the query's information need. These identified concepts enable precise semantic matching, significantly enhancing retrieval accuracy. Experiments show that SemRank consistently improves the performance of various base retrievers, surpasses strong existing LLM-based baselines, and remains highly efficient.
Qifeng Wu, Zhengzhe Liu, Han Zhu, Yizhou Zhao, Daisuke Kihara, Min Xu
This paper aims to retrieve proteins with similar structures and semantics
from large-scale protein dataset, facilitating the functional interpretation of
protein structures derived by structural determination methods like
cryo-Electron Microscopy (cryo-EM). Motivated by the recent progress of
vision-language models (VLMs), we propose a CLIP-style framework for aligning
3D protein structures with functional annotations using contrastive learning.
For model training, we propose a large-scale dataset of approximately 200,000
protein-caption pairs with rich functional descriptors. We evaluate our model
in both in-domain and more challenging cross-database retrieval on Protein Data
Bank (PDB) and Electron Microscopy Data Bank (EMDB) dataset, respectively. In
both cases, our approach demonstrates promising zero-shot retrieval
performance, highlighting the potential of multimodal foundation models for
structure-function understanding in protein biology.
Authors' comments: 4 pages for body, 3 pages for appendix, 11 figures. Accepted to CVPR
2025 Workshop on Multimodal Foundation Models for Biomedicine: Challenges and
Opportunities(MMFM-BIOMED)
Yuhao Wang, Ruiyang Ren, Yucheng Wang, Wayne Xin Zhao, Jing Liu, Hua Wu, Haifeng Wang
Long-form question answering (LFQA) presents unique challenges for large language models, requiring the synthesis of coherent, paragraph-length answers. While retrieval-augmented generation (RAG) systems have emerged as a promising solution, existing research struggles with key limitations: the scarcity of high-quality training data for long-form generation, the compounding risk of hallucination in extended outputs, and the absence of reliable evaluation metrics for factual completeness. In this paper, we propose RioRAG, a novel reinforcement learning (RL) framework that advances long-form RAG through reinforced informativeness optimization. Our approach introduces two fundamental innovations to address the core challenges. First, we develop an RL training paradigm of reinforced informativeness optimization that directly optimizes informativeness and effectively addresses the slow-thinking deficit in conventional RAG systems, bypassing the need for expensive supervised data. Second, we propose a nugget-centric hierarchical reward modeling approach that enables precise assessment of long-form answers through a three-stage process: extracting the nugget from every source webpage, constructing a nugget claim checklist, and computing rewards based on factual alignment. Extensive experiments on two LFQA benchmarks LongFact and RAGChecker demonstrate the effectiveness of the proposed method. Our codes are available at https://github.com/RUCAIBox/RioRAG.
Huiyao Chen, Yi Yang, Yinghui Li, Meishan Zhang, Min Zhang
Long document understanding has become increasingly crucial in natural
language processing, with retrieval-based methods emerging as a promising
solution to address the context length limitations of large language models
(LLMs). However, existing approaches either treat documents as flat sequences
or employ arbitrary chunking strategies, failing to capture the inherent
discourse structure that guides human comprehension. We present DISRetrieval, a
novel hierarchical retrieval framework that leverages linguistic discourse
structure to enhance long document understanding. Our approach introduces three
key innovations: (1) a discourse-aware document organization framework that
utilizes rhetorical structure theory (RST) to create sentence-level
hierarchical representations, preserving both semantic relationships and
natural document flow; (2) an LLM-enhanced node representation technique that
combines discourse structure with adaptive summarization to enrich tree nodes
with contextual information; and (3) a hierarchical evidence retrieval
mechanism that effectively selects relevant content while maintaining discourse
coherence. Through comprehensive experiments on QASPER and QuALITY datasets,
DISRetrieval demonstrates substantial improvements over existing methods in
both token-level retrieval metrics and downstream question answering tasks. Our
ablation studies confirm that incorporating discourse structure significantly
enhances retrieval effectiveness across different document lengths and query
types, validating the importance of linguistically-informed document
representation in long-text understanding. Our code and datasets are publicly
available at github/DreamH1gh/DISRetrieval to facilitate future research.
Authors' comments: 21 pages, 7 figures
Yanzhen Shen, Sihao Chen, Xueqiang Xu, Yunyi Zhang, Chaitanya Malaviya, Dan Roth
While significant progress has been made with dual- and bi-encoder dense retrievers, they often struggle on queries with logical connectives, a use case that is often overlooked yet important in downstream applications. Current dense retrievers struggle with such queries, such that the retrieved results do not respect the logical constraints implied in the queries. To address this challenge, we introduce LogiCoL, a logically-informed contrastive learning objective for dense retrievers. LogiCoL builds upon in-batch supervised contrastive learning, and learns dense retrievers to respect the subset and mutually-exclusive set relation between query results via two sets of soft constraints expressed via t-norm in the learning objective. We evaluate the effectiveness of LogiCoL on the task of entity retrieval, where the model is expected to retrieve a set of entities in Wikipedia that satisfy the implicit logical constraints in the query. We show that models trained with LogiCoL yield improvement both in terms of retrieval performance and logical consistency in the results. We provide detailed analysis and insights to uncover why queries with logical connectives are challenging for dense retrievers and why LogiCoL is most effective.
Juntong Wu, Zijing Liu, He Cao, Hao Li, Bin Feng, Zishan Shu, Ke Yu, Li Yuan et al.
In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model's performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data can be seen at https://github.com/IDEA-XL/RAPM.
Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He et al.
Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).
Kidist Amde Mekonnen, Yosef Worku Alemneh, Maarten de Rijke
Neural retrieval methods using transformer-based pre-trained language models
have advanced multilingual and cross-lingual retrieval. However, their
effectiveness for low-resource, morphologically rich languages such as Amharic
remains underexplored due to data scarcity and suboptimal tokenization. We
address this gap by introducing Amharic-specific dense retrieval models based
on pre-trained Amharic BERT and RoBERTa backbones. Our proposed
RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative
improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest
multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact
variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while
being over 13x smaller. Additionally, we train a ColBERT-based late interaction
retrieval model that achieves the highest MRR@10 score (0.843) among all
evaluated models. We benchmark our proposed models against both sparse and
dense retrieval baselines to systematically assess retrieval effectiveness in
Amharic. Our analysis highlights key challenges in low-resource settings and
underscores the importance of language-specific adaptation. To foster future
research in low-resource IR, we publicly release our dataset, codebase, and
trained models at https://github.com/kidist-amde/amharic-ir-benchmarks.
Authors' comments: 10 pages (excluding references and appendix), 10 figures. Accepted to
ACL 2025 Findings. Public release includes dataset, code, and trained models:
https://github.com/kidist-amde/amharic-ir-benchmarks
Zirui Li, Siwei Wu, Xingyu Wang, Yi Zhou, Yizhi Li, Chenghua Lin
The rapid advancement of unsupervised representation learning and large-scale
pre-trained vision-language models has significantly improved cross-modal
retrieval tasks. However, existing multi-modal information retrieval (MMIR)
studies lack a comprehensive exploration of document-level retrieval and suffer
from the absence of cross-domain datasets at this granularity. To address this
limitation, we introduce DocMMIR, a novel multi-modal document retrieval
framework designed explicitly to unify diverse document formats and domains,
including Wikipedia articles, scientific papers (arXiv), and presentation
slides, within a comprehensive retrieval scenario. We construct a large-scale
cross-domain multimodal benchmark, comprising 450K samples, which
systematically integrates textual and visual information. Our comprehensive
experimental analysis reveals substantial limitations in current
state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our
tasks, with only CLIP demonstrating reasonable zero-shot performance.
Furthermore, we conduct a systematic investigation of training strategies,
including cross-modal fusion methods and loss functions, and develop a tailored
approach to train CLIP on our benchmark. This results in a +31% improvement in
MRR@10 compared to the zero-shot baseline. All our data and code are released
in https://github.com/J1mL1/DocMMIR.
Authors' comments: Comments: 13 pages, 7 figures. Code and data publicly available at
https://github.com/J1mL1/DocMMIR
Robin D. Pesl, Jerin G. Mathew, Massimo Mecella, Marco Aiello
Integrating multiple (sub-)systems is essential to create advanced
Information Systems. Difficulties mainly arise when integrating dynamic
environments, e.g., the integration at design time of not yet existing
services. This has been traditionally addressed using a registry that provides
the API documentation of the endpoints. Large Language Models have shown to be
capable of automatically creating system integrations (e.g., as service
composition) based on this documentation but require concise input due to input
oken limitations, especially regarding comprehensive API descriptions.
Currently, it is unknown how best to preprocess these API descriptions. In the
present work, we (i) analyze the usage of Retrieval Augmented Generation for
endpoint discovery and the chunking, i.e., preprocessing, of state-of-practice
OpenAPIs to reduce the input oken length while preserving the most relevant
information. To further reduce the input token length for the composition
prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that
only receives a summary of the most relevant endpoints nd retrieves
specification details on demand. We evaluate RAG for endpoint discovery using
(iii) a proposed novel service discovery benchmark SOCBench-D representing a
general setting across numerous domains and the real-world RestBench enchmark,
first, for the different chunking possibilities and parameters measuring the
endpoint retrieval accuracy. Then, we assess the Discovery Agent using the same
test data set. The prototype shows how to successfully employ RAG for endpoint
discovery to reduce the token count. Our experiments show that endpoint-based
approaches outperform naive chunking methods for preprocessing. Relying on an
agent significantly improves precision while being prone to decrease recall,
disclosing the need for further reasoning capabilities.
Authors' comments: arXiv admin note: substantial text overlap with arXiv:2411.19804
Yaoyang Liu, Junlin Li, Yinjun Wu, Zhen Chen
Although Multi-Vector Retrieval (MVR) has achieved the state of the art on
many information retrieval (IR) tasks, its performance highly depends on how to
decompose queries into smaller pieces, say phrases or tokens. However,
optimizing query decomposition for MVR performance is not end-to-end
differentiable. Even worse, jointly solving this problem and training the
downstream retrieval-based systems, say RAG systems could be highly
inefficient. To overcome these challenges, we propose Performance-Oriented
Query Decomposer (POQD), a novel query decomposition framework for MVR. POQD
leverages one LLM for query decomposition and searches the optimal prompt with
an LLM-based optimizer. We further propose an end-to-end training algorithm to
alternatively optimize the prompt for query decomposition and the downstream
models. This algorithm can achieve superior MVR performance at a reasonable
training cost as our theoretical analysis suggests. POQD can be integrated
seamlessly into arbitrary retrieval-based systems such as Retrieval-Augmented
Generation (RAG) systems. Extensive empirical studies on representative
RAG-based QA tasks show that POQD outperforms existing query decomposition
strategies in both retrieval performance and end-to-end QA accuracy. POQD is
available at https://github.com/PKU-SDS-lab/POQD-ICML25.
Authors' comments: Published in ICML 2025
Abhijit Chakraborty, Chahana Dahal, Vivek Gupta
Federated Retrieval-Augmented Generation (Federated RAG) combines Federated Learning (FL), which enables distributed model training without exposing raw data, with Retrieval-Augmented Generation (RAG), which improves the factual accuracy of language models by grounding outputs in external knowledge. As large language models are increasingly deployed in privacy-sensitive domains such as healthcare, finance, and personalized assistance, Federated RAG offers a promising framework for secure, knowledge-intensive natural language processing (NLP). To the best of our knowledge, this paper presents the first systematic mapping study of Federated RAG, covering literature published between 2020 and 2025. Following Kitchenham's guidelines for evidence-based software engineering, we develop a structured classification of research focuses, contribution types, and application domains. We analyze architectural patterns, temporal trends, and key challenges, including privacy-preserving retrieval, cross-client heterogeneity, and evaluation limitations. Our findings synthesize a rapidly evolving body of research, identify recurring design patterns, and surface open questions, providing a foundation for future work at the intersection of RAG and federated systems.
Yongjie Wang, Jonathan Leung, Zhiqi Shen
Large Language Models (LLMs) have shown promise in character imitation,
enabling immersive and engaging conversations. However, they often generate
content that is irrelevant or inconsistent with a character's background. We
attribute these failures to: (1) the inability to accurately recall
character-specific knowledge due to entity ambiguity, and (2) a lack of
awareness of the character's cognitive boundaries. To address these issues, we
propose RoleRAG, a retrieval-based framework that integrates efficient entity
disambiguation for knowledge indexing with a boundary-aware retriever for
extracting contextually appropriate information from a structured knowledge
graph. Experiments on role-playing benchmarks show that RoleRAG's calibrated
retrieval helps both general-purpose and role-specific LLMs better align with
character knowledge and reduce hallucinated responses.
Authors' comments: A Retrieval-enhanced LLM Role-playing
Ainulla Khan, Yamada Moyuru, Srinidhi Akella
Retrieval-Augmented Generation (RAG) has emerged as a promising technique to enhance the quality and relevance of responses generated by large language models. While recent advancements have mainly focused on improving RAG for text-based queries, RAG on multi-modal documents containing both texts and images has not been fully explored. Especially when fine-tuning does not work. This paper proposes BRIT, a novel multi-modal RAG framework that effectively unifies various text-image connections in the document into a multi-modal graph and retrieves the texts and images as a query-specific sub-graph. By traversing both image-to-text and text-to-image paths in the graph, BRIT retrieve not only directly query-relevant images and texts but also further relevant contents to answering complex cross-modal multi-hop questions. To evaluate the effectiveness of BRIT, we introduce MM-RAG test set specifically designed for multi-modal question answering tasks that require to understand the text-image relations. Our comprehensive experiments demonstrate the superiority of BRIT, highlighting its ability to handle cross-modal questions on the multi-modal documents.