Aakash Mahalingam, Vinesh Kumar Gande, Aman Chadha, Vinija Jain, Divya Chaudhary
Retrieval-Augmented Generation (RAG) systems have become pivotal in
leveraging vast corpora to generate informed and contextually relevant
responses, notably reducing hallucinations in Large Language Models. Despite
significant advancements, these systems struggle to efficiently process and
retrieve information from large datasets while maintaining a comprehensive
understanding of the context. This paper introduces SKETCH, a novel methodology
that enhances the RAG retrieval process by integrating semantic text retrieval
with knowledge graphs, thereby merging structured and unstructured data for a
more holistic comprehension. SKETCH, demonstrates substantial improvements in
retrieval performance and maintains superior context integrity compared to
traditional methods. Evaluated across four diverse datasets: QuALITY, QASPER,
NarrativeQA, and Italian Cuisine-SKETCH consistently outperforms baseline
approaches on key RAGAS metrics such as answer_relevancy, faithfulness,
context_precision and context_recall. Notably, on the Italian Cuisine dataset,
SKETCH achieved an answer relevancy of 0.94 and a context precision of 0.99,
representing the highest performance across all evaluated metrics. These
results highlight SKETCH's capability in delivering more accurate and
contextually relevant responses, setting new benchmarks for future retrieval
systems.
Authors' comments: 16 pages, 8 figures, Workshop on Generative AI and Knowledge Graphs
(GenAIK) at The 31st International Conference on Computational Linguistics
(COLING 2025)
Zhang Siyue, Xue Yuxiang, Zhang Yiming, Wu Xiaobao, Luu Anh Tuan, Zhao Chen
Understanding temporal relations and answering time-sensitive questions is crucial yet a challenging task for question-answering systems powered by large language models (LLMs). Existing approaches either update the parametric knowledge of LLMs with new facts, which is resource-intensive and often impractical, or integrate LLMs with external knowledge retrieval (i.e., retrieval-augmented generation). However, off-the-shelf retrievers often struggle to identify relevant documents that require intensive temporal reasoning. To systematically study time-sensitive question answering, we introduce the TempRAGEval benchmark, which repurposes existing datasets by incorporating temporal perturbations and gold evidence labels. As anticipated, all existing retrieval methods struggle with these temporal reasoning-intensive questions. We further propose Modular Retrieval (MRAG), a trainless framework that includes three modules: (1) Question Processing that decomposes question into a main content and a temporal constraint; (2) Retrieval and Summarization that retrieves evidence and uses LLMs to summarize according to the main content; (3) Semantic-Temporal Hybrid Ranking that scores each evidence summarization based on both semantic and temporal relevance. On TempRAGEval, MRAG significantly outperforms baseline retrievers in retrieval performance, leading to further improvements in final answer accuracy.
Kyle Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, João F. Ferreira, Sorin Lerner et al.
Formal verification using proof assistants, such as Coq, enables the creation
of high-quality software. However, the verification process requires
significant expertise and manual effort to write proofs. Recent work has
explored automating proof synthesis using machine learning and large language
models (LLMs). This work has shown that identifying relevant premises, such as
lemmas and definitions, can aid synthesis. We present Rango, a fully automated
proof synthesis tool for Coq that automatically identifies relevant premises
and also similar proofs from the current project and uses them during
synthesis. Rango uses retrieval augmentation at every step of the proof to
automatically determine which proofs and premises to include in the context of
its fine-tuned LLM. In this way, Rango adapts to the project and to the
evolving state of the proof. We create a new dataset, CoqStoq, of 2,226
open-source Coq projects and 196,929 theorems from GitHub, which includes both
training data and a curated evaluation benchmark of well-maintained projects.
On this benchmark, Rango synthesizes proofs for 32.0% of the theorems, which is
29% more theorems than the prior state-of-the-art tool Tactician. Our
evaluation also shows that Rango adding relevant proofs to its context leads to
a 47% increase in the number of theorems proven.
Authors' comments: In Proceedings of the 47th International Conference on Software
Engineering (ICSE), Ottawa, ON, Canada, April 2025
Parker Addison, Minh-Tuan H. Nguyen, Tomislav Medan, Mohammad T. Manzari, Brendan McElrone, Laksh Lalwani, Aboli More, Smita Sharma et al.
Organizations seeking to utilize Large Language Models (LLMs) for knowledge querying and analysis often encounter challenges in maintaining an LLM fine-tuned on targeted, up-to-date information that keeps answers relevant and grounded. Retrieval Augmented Generation (RAG) has quickly become a feasible solution for organizations looking to overcome the challenges of maintaining proprietary models and to help reduce LLM hallucinations in their query responses. However, RAG comes with its own issues regarding scaling data pipelines across tiered-access and disparate data sources. In many scenarios, it is necessary to query beyond a single data silo to provide richer and more relevant context for an LLM. Analyzing data sources within and across organizational trust boundaries is often limited by complex data-sharing policies that prohibit centralized data storage, therefore, inhibit the fast and effective setup and scaling of RAG solutions. In this paper, we introduce Confidential Computing (CC) techniques as a solution for secure Federated Retrieval Augmented Generation (FedRAG). Our proposed Confidential FedRAG system (C-FedRAG) enables secure connection and scaling of a RAG workflows across a decentralized network of data providers by ensuring context confidentiality. We also demonstrate how to implement a C-FedRAG system using the NVIDIA FLARE SDK and assess its performance using the MedRAG toolkit and MIRAGE benchmarking dataset.
Mohammad Mahdi Abootorabi, Ehsaneddin Asgari
This study introduces CLASP (Contrastive Language-Speech Pretraining), a
multilingual, multimodal representation tailored for audio-text information
retrieval. CLASP leverages the synergy between spoken content and textual data.
During training, we utilize our newly introduced speech-text dataset, which
encompasses 15 diverse categories ranging from fiction to religion. CLASP's
audio component integrates audio spectrograms with a pre-trained
self-supervised speech model, while its language encoding counterpart employs a
sentence encoder pre-trained on over 100 languages. This unified lightweight
model bridges the gap between various modalities and languages, enhancing its
effectiveness in handling and retrieving multilingual and multimodal data. Our
evaluations across multiple languages demonstrate that CLASP establishes new
benchmarks in HITS@1, MRR, and meanR metrics, outperforming traditional
ASR-based retrieval approaches in specific scenarios.
Authors' comments: accepted at ECIR 2025
Wangyu Xue, Chen Qian, Jiayi Wu, Yang Zhou, Wentao Liu, Ju Ren, Siming Fan, Yaoxue Zhang
Existing works on human-centric video understanding typically focus on analyzing specific moment or entire videos. However, many applications require higher precision at the frame level. In this work, we propose a novel task, BestShot, which aims to locate highlight frames within human-centric videos via language queries. This task demands not only a deep semantic comprehension of human actions but also precise temporal localization. To support this task, we introduce the BestShot Benchmark. %The benchmark is meticulously constructed by combining human detection and tracking, potential frame selection based on human judgment, and detailed textual descriptions crafted by human input to ensure precision. The benchmark is meticulously constructed by combining human-annotated highlight frames, detailed textual descriptions and duration labeling. These descriptions encompass three critical elements: (1) Visual content; (2) Fine-grained action; and (3) Human Pose Description. Together, these elements provide the necessary precision to identify the exact highlight frames in videos. To tackle this problem, we have collected two distinct datasets: (i) ShotGPT4o Dataset, which is algorithmically generated by GPT-4o and (ii) Image-SMPLText Dataset, a dataset with large-scale and accurate per-frame pose description leveraging PoseScript and existing pose estimation datasets. Based on these datasets, we present a strong baseline model, ShotVL, fine-tuned from InternVL, specifically for BestShot. We highlight the impressive zero-shot capabilities of our model and offer comparative analyses with existing SOTA models. ShotVL demonstrates a significant 52% improvement over InternVL on the BestShot Benchmark and a notable 57% improvement on the THUMOS14 Benchmark, all while maintaining the SOTA performance in general image classification and retrieval.
Taeho Hwang, Sukmin Cho, Soyeong Jeong, Hoyun Song, SeungYoon Han, Jong C. Park
We introduce EXIT, an extractive context compression framework that enhances
both the effectiveness and efficiency of retrieval-augmented generation (RAG)
in question answering (QA). Current RAG systems often struggle when retrieval
models fail to rank the most relevant documents, leading to the inclusion of
more context at the expense of latency and accuracy. While abstractive
compression methods can drastically reduce token counts, their token-by-token
generation process significantly increases end-to-end latency. Conversely,
existing extractive methods reduce latency but rely on independent,
non-adaptive sentence selection, failing to fully utilize contextual
information. EXIT addresses these limitations by classifying sentences from
retrieved documents - while preserving their contextual dependencies - enabling
parallelizable, context-aware extraction that adapts to query complexity and
retrieval quality. Our evaluations on both single-hop and multi-hop QA tasks
show that EXIT consistently surpasses existing compression methods and even
uncompressed baselines in QA accuracy, while also delivering substantial
reductions in inference time and token count. By improving both effectiveness
and efficiency, EXIT provides a promising direction for developing scalable,
high-quality QA solutions in RAG pipelines. Our code is available at
https://github.com/ThisIsHwang/EXIT
Authors' comments: Under Review
Zehua Xia, Yuyang Wu, Yiyun Xia, Cam-Tu Nguyen
Multi-hop question answering (QA) often requires sequential retrieval
(multi-hop retrieval), where each hop retrieves missing knowledge based on
information from previous hops. To facilitate more effective retrieval, we aim
to distill knowledge from a posterior retrieval, which has access to posterior
information like an answer, into a prior retrieval used during inference when
such information is unavailable. Unfortunately, current methods for knowledge
distillation in one-time retrieval are ineffective for multi-hop QA due to two
issues: 1) Posterior information is often defined as the response (i.e. the
answer), which may not clearly connect to the query without intermediate
retrieval; and 2) The large knowledge gap between prior and posterior
retrievals makes existing distillation methods unstable, even resulting in
performance loss. As such, we propose MoPo (Momentum Posterior Regularization)
with two key innovations: 1) Posterior information of one hop is defined as a
query-focus summary from the golden knowledge of the previous and current hops;
2) We develop an effective training strategy where the posterior retrieval is
updated along with the prior retrieval via momentum moving average method,
allowing smoother and effective distillation. Experiments on HotpotQA and
StrategyQA demonstrate that MoPo outperforms existing baselines in both
retrieval and downstream QA tasks.
Authors' comments: Accepted by COLING 2025
Jaeseok Yoo, Hojae Han, Youngwon Lee, Jaejin Kim, Seung-won Hwang
Code generation with large language models has shown significant promise,
especially when employing retrieval-augmented generation (RAG) with few-shot
examples. However, selecting effective examples that enhance generation quality
remains a challenging task, particularly when the target programming language
(PL) is underrepresented. In this study, we present two key findings: (1)
retrieving examples whose presented algorithmic plans can be referenced for
generating the desired behavior significantly improves generation accuracy, and
(2) converting code into pseudocode effectively captures such algorithmic
plans, enhancing retrieval quality even when the source and the target PLs are
different. Based on these findings, we propose Plan-as-query Example Retrieval
for few-shot prompting in Code generation (PERC), a novel framework that
utilizes algorithmic plans to identify and retrieve effective examples. We
validate the effectiveness of PERC through extensive experiments on the
CodeContests, HumanEval and MultiPL-E benchmarks: PERC consistently outperforms
the state-of-the-art RAG methods in code generation, both when the source and
target programming languages match or differ, highlighting its adaptability and
robustness in diverse coding environments.
Authors' comments: Accepted by COLING 2025 main conference
Yueqian Lin, Yuzhe Fu, Jingyang Zhang, Yudong Liu, Jianyi Zhang, Jingwei Sun, Hai "Helen" Li, Yiran Chen
We introduce Speech Information Retrieval (SIR), a new long-context task for
Speech Large Language Models (Speech LLMs), and present SPIRAL, a 1,012-sample
benchmark testing models' ability to extract critical details from
approximately 90-second spoken inputs. While current Speech LLMs excel at
short-form tasks, they struggle with the computational and representational
demands of longer audio sequences. To address this limitation, we propose
SpeechPrune, a training-free token pruning strategy that uses speech-text
similarity and approximated attention scores to efficiently discard irrelevant
tokens. In SPIRAL, SpeechPrune achieves accuracy improvements of 29% and up to
47% over the original model and the random pruning model at a pruning rate of
20%, respectively. SpeechPrune can maintain network performance even at a
pruning level of 80%. This approach highlights the potential of token-level
pruning for efficient and scalable long-form speech understanding.
Authors' comments: Accepted at IEEE ICME 2025. Project page:
https://speechprune.github.io/
Yang Yang, Wenjuan Xi, Luping Zhou, Jinhui Tang
Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly, we propose a simple yet effective method to rebalance cross-modal matching by learning structure-preserved matching representations. Specifically, we design a novel multi-granularity cross-modal matching that incorporates structure-aware distillation alongside the cross-modal matching loss. While the cross-modal matching loss constraints instance-level matching, the structure-aware distillation further regularizes the geometric consistency between learned matching representations and intra-modal representations through the developed relational matching. Extensive experiments on different datasets affirm the superior cross-modal retrieval performance of our approach, simultaneously enhancing single-modal retrieval capabilities compared to the baseline models.
Youngwon Lee, Seung-won Hwang, Daniel Campos, Filip Graliński, Zhewei Yao, Yuxiong He
Retrieval-augmented generation (RAG) has emerged as a popular approach to steering the output of a large language model (LLM) by incorporating retrieved contexts as inputs. However, existing work observed the generator bias, such that improving the retrieval results may negatively affect the outcome. In this work, we show such bias can be mitigated, from inference scaling, aggregating inference calls from the permuted order of retrieved contexts. The proposed Mixture-of-Intervention (MOI) explicitly models the debiased utility of each passage with multiple forward passes to construct a new ranking. We also show that MOI can leverage the retriever's prior knowledge to reduce the computational cost by minimizing the number of permutations considered and lowering the cost per LLM call. We showcase the effectiveness of MOI on diverse RAG tasks, improving ROUGE-L on MS MARCO and EM on HotpotQA benchmarks by ~7 points.
Federico Siciliano, Francesca Pezzuti, Nicola Tonellotto, Fabrizio Silvestri
In the era of dense retrieval, document indexing and retrieval is largely based on encoding models that transform text documents into embeddings. The efficiency of retrieval is directly proportional to the number of documents and the size of the embeddings. Recent studies have shown that it is possible to reduce embedding size without sacrificing - and in some cases improving - the retrieval effectiveness. However, the methods introduced by these studies are query-dependent, so they can't be applied offline and require additional computations during query processing, thus negatively impacting the retrieval efficiency. In this paper, we present a novel static pruning method for reducing the dimensionality of embeddings using Principal Components Analysis. This approach is query-independent and can be executed offline, leading to a significant boost in dense retrieval efficiency with a negligible impact on the system effectiveness. Our experiments show that our proposed method reduces the dimensionality of document representations by over 50% with up to a 5% reduction in NDCG@10, for different dense retrieval models.
Baisen Wang, Le Zhuo, Zhaokai Wang, Chenxi Bao, Wu Chengjing, Xuecheng Nie, Jiao Dai, Jizhong Han et al.
Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their application in multimodal music generation faces challenges of data scarcity, weak cross-modal alignment, and limited controllability. This paper addresses these issues by using explicit bridges of text and music for multimodal alignment. We introduce a novel method named Visuals Music Bridge (VMB). Specifically, a Multimodal Music Description Model converts visual inputs into detailed textual descriptions to provide the text bridge; a Dual-track Music Retrieval module that combines broad and targeted retrieval strategies to provide the music bridge and enable user control. Finally, we design an Explicitly Conditioned Music Generation framework to generate music based on the two bridges. We conduct experiments on video-to-music, image-to-music, text-to-music, and controllable music generation tasks, along with experiments on controllability. The results demonstrate that VMB significantly enhances music quality, modality, and customization alignment compared to previous methods. VMB sets a new standard for interpretable and expressive multimodal music generation with applications in various multimedia fields. Demos and code are available at https://github.com/wbs2788/VMB.
Yunna Lv, Long Tang, Dengpan Ye, Caiyun Xie, Jiacheng Deng, Yiheng He
Deep hash-based retrieval techniques are widely used in facial retrieval systems to improve the efficiency of facial matching. However, it also brings the risk of privacy leakage. Deep hash models are easily influenced by adversarial examples, which can be leveraged to prevent the malicious retrieval of private images. The existing adversarial example methods against deep hash models focus on universality and transferability, lacking the research on its robustness in online social networks (OSNs), which leads to their failure in anti-retrieval after post-processing. Therefore, we provide the first in-depth discussion on robustness adversarial perturbation in universal transferable anti-facial retrieval and propose Three-in-One Adversarial Perturbation (TOAP). Specifically, we firstly analyze the performance of deep hash models after post-processing and construct a local and global Compression Generator (CG) to simulate complex post-processing scenarios. Then, we explore the variation patterns of the model's objective under image post-processing and propose robust optimization objectives, cluster centers and data space centers, optimizing them using meta-learning. Finally, we iteratively optimize perturbation by alternately generating adversarial examples and fine-tuning the CG, balancing the performance of perturbation while enhancing CG's ability to mitigate them. Numerous experiments demonstrate that, in addition to its advantages in universality and transferability, TOAP significantly outperforms current state-of-the-art methods in multiple robustness metrics. It further improves universality and transferability by 5% to 28%, and achieves up to about 33% significant improvement in several simulated post-processing scenarios as well as mainstream OSNs, demonstrating that TOAP can effectively protect private images from malicious retrieval in real-world scenarios.
Pengyue Jia, Derong Xu, Xiaopeng Li, Zhaocheng Du, Xiangyang Li, Yichao Wang, Yuhao Wang, Qidong Liu et al.
The reranker and generator are two critical components in the
Retrieval-Augmented Generation (i.e., RAG) pipeline, responsible for ranking
relevant documents and generating responses. However, due to differences in
pre-training data and objectives, there is an inevitable gap between the
documents ranked as relevant by the reranker and those required by the
generator to support answering the query. To address this gap, we propose
RADIO, a novel and practical preference alignment framework with RAtionale
DIstillatiOn. Specifically, we first propose a rationale extraction method that
leverages the reasoning capabilities of Large Language Models (LLMs) to extract
the rationales necessary for answering the query. Subsequently, a
rationale-based alignment process is designed to rerank the documents based on
the extracted rationales, and fine-tune the reranker to align the preferences.
We conduct extensive experiments on two tasks across three datasets to
demonstrate the effectiveness of our approach compared to baseline methods. Our
code is released online to ease reproduction.
Authors' comments: Accepted to ACL 25 Findings
Ellen Yi-Ge, Jiechao Gao, Wei Han, Wei Zhu
Recently, large language models (LLMs) have demonstrated impressive capabilities in dealing with new tasks with the help of in-context learning (ICL). In the study of Large Vision-Language Models (LVLMs), when implementing ICL, researchers usually adopts the naive strategies like fixed demonstrations across different samples, or selecting demonstrations directly via a visual-language embedding model. These methods does not guarantee the configured demonstrations fit the need of the LVLMs. To address this issue, we now propose a novel framework, \underline{d}emonstration \underline{r}etriever for large m\underline{u}lti-modal \underline{m}odel (DRUM), which fine-tunes the visual-language embedding model to better meet the LVLM's needs. First, we discuss the retrieval strategies for a visual-language task, assuming an embedding model is given. And we propose to concate the image and text embeddings to enhance the retrieval performance. Second, we propose to re-rank the demonstrations retrieved by the embedding model via the LVLM's feedbacks, and calculate a list-wise ranking loss for training the embedding model. Third, we propose an iterative demonstration mining strategy to improve the training of the embedding model. Through extensive experiments on 3 types of visual-language tasks, 7 benchmark datasets, our DRUM framework is proven to be effective in boosting the LVLM's in-context learning performance via retrieving more proper demonstrations.
Quoc-Bao Nguyen-Le, Thanh-Huy Le-Nguyen
Current video retrieval systems, especially those used in competitions,
primarily focus on querying individual keyframes or images rather than encoding
an entire clip or video segment. However, queries often describe an action or
event over a series of frames, not a specific image. This results in
insufficient information when analyzing a single frame, leading to less
accurate query results. Moreover, extracting embeddings solely from images
(keyframes) does not provide enough information for models to encode
higher-level, more abstract insights inferred from the video. These models tend
to only describe the objects present in the frame, lacking a deeper
understanding. In this work, we propose a system that integrates the latest
methodologies, introducing a novel pipeline that extracts multimodal data, and
incorporate information from multiple frames within a video, enabling the model
to abstract higher-level information that captures latent meanings, focusing on
what can be inferred from the video clip, rather than just focusing on object
detection in one single image.
Authors' comments: 9 pages, 4 figures
Ehsan Lotfi, Nikolay Banar, Nerses Yuzbashyan, Walter Daelemans
Statutory article retrieval plays a crucial role in making legal information
more accessible to both laypeople and legal professionals. Multilingual
countries like Belgium present unique challenges for retrieval models due to
the need for handling legal issues in multiple languages. Building on the
Belgian Statutory Article Retrieval Dataset (BSARD) in French, we introduce the
bilingual version of this dataset, bBSARD. The dataset contains parallel
Belgian statutory articles in both French and Dutch, along with legal questions
from BSARD and their Dutch translation. Using bBSARD, we conduct extensive
benchmarking of retrieval models available for Dutch and French. Our
benchmarking setup includes lexical models, zero-shot dense models, and
fine-tuned small foundation models. Our experiments show that BM25 remains a
competitive baseline compared to many zero-shot dense models in both languages.
We also observe that while proprietary models outperform open alternatives in
the zero-shot setting, they can be matched or surpassed by fine-tuning small
language-specific models. Our dataset and evaluation code are publicly
available.
Authors' comments: To be presented at RegNLP-2025 (COLING)
Sibei Chen, Ju Fan, Bin Wu, Nan Tang, Chao Deng, Pengyi Wang, Ye Li, Jian Tan et al.
Database management system (DBMS) configuration debugging, e.g., diagnosing poorly configured DBMS knobs and generating troubleshooting recommendations, is crucial in optimizing DBMS performance. However, the configuration debugging process is tedious and, sometimes challenging, even for seasoned database administrators (DBAs) with sufficient experience in DBMS configurations and good understandings of the DBMS internals (e.g., MySQL or Oracle). To address this difficulty, we propose Andromeda, a framework that utilizes large language models (LLMs) to enable automatic DBMS configuration debugging. Andromeda serves as a natural surrogate of DBAs to answer a wide range of natural language (NL) questions on DBMS configuration issues, and to generate diagnostic suggestions to fix these issues. Nevertheless, directly prompting LLMs with these professional questions may result in overly generic and often unsatisfying answers. To this end, we propose a retrieval-augmented generation (RAG) strategy that effectively provides matched domain-specific contexts for the question from multiple sources. They come from related historical questions, troubleshooting manuals and DBMS telemetries, which significantly improve the performance of configuration debugging. To support the RAG strategy, we develop a document retrieval mechanism addressing heterogeneous documents and design an effective method for telemetry analysis. Extensive experiments on real-world DBMS configuration debugging datasets show that Andromeda significantly outperforms existing solutions.