Elia Peruzzo, Dejia Xu, Xingqian Xu, Humphrey Shi, Nicu Sebe
Video generation is experiencing rapid growth, driven by advances in
diffusion models and the development of better and larger datasets. However,
producing high-quality videos remains challenging due to the high-dimensional
data and the complexity of the task. Recent efforts have primarily focused on
enhancing visual quality and addressing temporal inconsistencies, such as
flickering. Despite progress in these areas, the generated videos often fall
short in terms of motion complexity and physical plausibility, with many
outputs either appearing static or exhibiting unrealistic motion. In this work,
we propose a framework to improve the realism of motion in generated videos,
exploring a complementary direction to much of the existing literature.
Specifically, we advocate for the incorporation of a retrieval mechanism during
the generation phase. The retrieved videos act as grounding signals, providing
the model with demonstrations of how the objects move. Our pipeline is designed
to apply to any text-to-video diffusion model, conditioning a pretrained model
on the retrieved samples with minimal fine-tuning. We demonstrate the
superiority of our approach through established metrics, recently proposed
benchmarks, and qualitative results, and we highlight additional applications
of the framework.
Authors' comments: Code available at: https://github.com/helia95/ragme
Da Li, Keping Bi, Jiafeng Guo, Xueqi Cheng
Table retrieval is essential for accessing information stored in structured tabular formats; however, it remains less explored than text retrieval. The content of the table primarily consists of phrases and words, which include a large number of entities, such as time, locations, persons, and organizations. Entities are well-studied in the context of text retrieval, but there is a noticeable lack of research on their applications in table retrieval. In this work, we explore how to leverage entities in tables to improve retrieval performance. First, we investigate the important role of entities in table retrieval from a statistical perspective and propose an entity-enhanced training framework. Subsequently, we use the type of entities to highlight entities instead of introducing an external knowledge base. Moreover, we design an interaction paradigm based on entity representations. Our proposed framework is plug-and-play and flexible, making it easy to integrate into existing table retriever training processes. Empirical results on two table retrieval benchmarks, NQ-TABLES and OTT-QA, show that our proposed framework is both simple and effective in enhancing existing retrievers. We also conduct extensive analyses to confirm the efficacy of different components. Overall, our work provides a promising direction for elevating table retrieval, enlightening future research in this area.
Khai Phan Tran, Xue Li
Document-level Relation Extraction (DocRE) involves identifying relations
between entities across multiple sentences in a document. Evidence sentences,
crucial for precise entity pair relationships identification, enhance focus on
essential text segments, improving DocRE performance. However, existing
evidence retrieval systems often overlook the collaborative nature among
semantically similar entity pairs in the same document, hindering the
effectiveness of the evidence retrieval task. To address this, we propose a
novel evidence retrieval framework, namely CDER. CDER employs an attentional
graph-based architecture to capture collaborative patterns and incorporates a
dynamic sub-structure for additional robustness in evidence retrieval.
Experimental results on the benchmark DocRE dataset show that CDER not only
excels in the evidence retrieval task but also enhances overall performance of
existing DocRE system.
Authors' comments: Published at ACIIDS 2024
Teng Shi, Jun Xu, Xiao Zhang, Xiaoxue Zang, Kai Zheng, Yang Song, Han Li
Recently, the personalization of Large Language Models (LLMs) to generate
content that aligns with individual user preferences has garnered widespread
attention. Personalized Retrieval-Augmented Generation (RAG), which retrieves
relevant documents from the user's history to reflect their preferences and
enhance LLM generation, is one commonly used approach for personalization.
However, existing personalized RAG methods do not consider that the histories
of similar users can also assist in personalized generation for the current
user, meaning that collaborative information between users can also benefit
personalized generation. Inspired by the application of collaborative filtering
in recommender systems, we propose a method called CFRAG, which adapts
Collaborative Filtering to RAG for personalized text generation. However, this
presents two challenges: (1)~how to incorporate collaborative information
without explicit user similarity labels? (2)~how to retrieve documents that
support personalized LLM generation? For Challenge 1, we use contrastive
learning to train user embeddings to retrieve similar users and introduce
collaborative information. For Challenge 2, we design a personalized retriever
and reranker to retrieve the top-$k$ documents from these users' histories. We
take into account the user's preference during retrieval and reranking. Then we
leverage feedback from the LLM to fine-tune the personalized retriever and
reranker, enabling them to retrieve documents that meet the personalized
generation needs of the LLM. Experimental results on the Language Model
Personalization (LaMP) benchmark validate the effectiveness of CFRAG. Further
analysis confirms the importance of incorporating collaborative information.
Authors' comments: Accepted by SIGIR 2025
Hajime Ueda, Shun Katakami, Masato Okada
We introduce a probabilistic approach to ptychographic reconstruction in
computational imaging. Ptychography is an imaging method where the complex
amplitude of an object is estimated from a sequence of diffraction
measurements. We formulate this reconstruction as a Bayesian inverse problem
and derive an inference algorithm, termed "Ptycho-EP," based on belief
propagation and Vector Approximate Message Passing from information theory.
Prior knowledge about the unknown object can be integrated into the
probabilistic model, and the Bayesian framework inherently provides uncertainty
quantification of the reconstruction. Numerical experiments demonstrate that,
when the probe's illumination function is known, our algorithm accurately
retrieves the object image at a sampling ratio approaching the information
theoretic limit. In scenarios where the illumination function is unknown, both
the object and the probe can be jointly reconstructed via an
Expectation-Maximization algorithm. We evaluate the performance of our
algorithm against conventional methods, highlighting its superior convergence
speed.
Authors' comments: 13 pages, 11 figures
Yucheng Chu, Peng He, Hang Li, Haoyu Han, Kaiqi Yang, Yu Xue, Tingting Li, Joseph Krajcik et al.
Short answer assessment is a vital component of science education, allowing evaluation of students' complex three-dimensional understanding. Large language models (LLMs) that possess human-like ability in linguistic tasks are increasingly popular in assisting human graders to reduce their workload. However, LLMs' limitations in domain knowledge restrict their understanding in task-specific requirements and hinder their ability to achieve satisfactory performance. Retrieval-augmented generation (RAG) emerges as a promising solution by enabling LLMs to access relevant domain-specific knowledge during assessment. In this work, we propose an adaptive RAG framework for automated grading that dynamically retrieves and incorporates domain-specific knowledge based on the question and student answer context. Our approach combines semantic search and curated educational sources to retrieve valuable reference materials. Experimental results in a science education dataset demonstrate that our system achieves an improvement in grading accuracy compared to baseline LLM approaches. The findings suggest that RAG-enhanced grading systems can serve as reliable support with efficient performance gains.
Kidist Amde Mekonnen, Yubao Tang, Maarten de Rijke
Generative information retrieval (GenIR) is a promising neural retrieval
paradigm that formulates document retrieval as a document identifier (docid)
generation task, allowing for end-to-end optimization toward a unified global
retrieval objective. However, existing GenIR models suffer from token-level
misalignment, where models trained to predict the next token often fail to
capture document-level relevance effectively. While reinforcement
learning-based methods, such as reinforcement learning from relevance feedback
(RLRF), aim to address this misalignment through reward modeling, they
introduce significant complexity, requiring the optimization of an auxiliary
reward function followed by reinforcement fine-tuning, which is computationally
expensive and often unstable. To address these challenges, we propose direct
document relevance optimization (DDRO), which aligns token-level docid
generation with document-level relevance estimation through direct optimization
via pairwise ranking, eliminating the need for explicit reward modeling and
reinforcement learning. Experimental results on benchmark datasets, including
MS MARCO document and Natural Questions, show that DDRO outperforms
reinforcement learning-based methods, achieving a 7.4% improvement in MRR@10
for MS MARCO and a 19.9% improvement for Natural Questions. These findings
highlight DDRO's potential to enhance retrieval effectiveness with a simplified
optimization approach. By framing alignment as a direct optimization problem,
DDRO simplifies the ranking optimization pipeline of GenIR models while
offering a viable alternative to reinforcement learning-based methods.
Authors' comments: 12 pages, 3 figures. SIGIR '25 Proceedings of the 48th International
ACM SIGIR Conference on Research and Development in Information Retrieval
July 13--18, 2025 Padua, Italy. Code and pretrained models available at:
https://github.com/kidist-amde/ddro/
Leonardo Ranaldi, Federico Ranaldi, Fabio Massimo Zanzotto, Barry Haddow, Alexandra Birch
Retrieval-augmented generation (RAG) is key to enhancing large language models (LLMs) to systematically access richer factual knowledge. Yet, using RAG brings intrinsic challenges, as LLMs must deal with potentially conflicting knowledge, especially in multilingual retrieval, where the heterogeneity of knowledge retrieved may deliver different outlooks. To make RAG more analytical, critical and grounded, we introduce Dialectic-RAG (DRAG), a modular approach guided by Argumentative Explanations, i.e., structured reasoning process that systematically evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives. Given a query and a set of multilingual related documents, DRAG selects and exemplifies relevant knowledge for delivering dialectic explanations that, by critically weighing opposing arguments and filtering extraneous content, clearly determine the final response. Through a series of in-depth experiments, we show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models. The final results demonstrate that DRAG significantly improves RAG approaches, requiring low-impact computational effort and providing robustness to knowledge perturbations.
Mohamed Eltahir, Osamah Sarraj, Mohammed Bremoo, Mohammed Khurd, Abdulrahman Alfrihidi, Taha Alshatiri, Mohammad Almatrafi, Tanveer Hussain
Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. Additionally, the aural stream includes a complementary audio-based two-stage retrieval mechanism that enhances performance on long-duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long-video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.
Peng Gao, Yujian Lee, Zailong Chen, Hui zhang, Xubo Liu, Yiyang Hu, Guquang Jing
Composed Image Retrieval (CIR) seeks to find a target image using a
multi-modal query, which combines an image with modification text to pinpoint
the target. While recent CIR methods have shown promise, they mainly focus on
exploring relationships between the query pairs (image and text) through data
augmentation or model design. These methods often assume perfect alignment
between queries and target images, an idealized scenario rarely encountered in
practice. In reality, pairs are often partially or completely mismatched due to
issues like inaccurate modification texts, low-quality target images, and
annotation errors. Ignoring these mismatches leads to numerous False Positive
Pair (FFPs) denoted as noise pairs in the dataset, causing the model to overfit
and ultimately reducing its performance. To address this problem, we propose
the Noise-aware Contrastive Learning for CIR (NCL-CIR), comprising two key
components: the Weight Compensation Block (WCB) and the Noise-pair Filter Block
(NFB). The WCB coupled with diverse weight maps can ensure more stable token
representations of multi-modal queries and target images. Meanwhile, the NFB,
in conjunction with the Gaussian Mixture Model (GMM) predicts noise pairs by
evaluating loss distributions, and generates soft labels correspondingly,
allowing for the design of the soft-label based Noise Contrastive Estimation
(NCE) loss function. Consequently, the overall architecture helps to mitigate
the influence of mismatched and partially matched samples, with experimental
results demonstrating that NCL-CIR achieves exceptional performance on the
benchmark datasets.
Authors' comments: Has been accepted by ICASSP2025
Amin Haeri, Jonathan Vitrano, Mahdi Ghelichi
Risk management in finance involves recognizing, evaluating, and addressing
financial risks to maintain stability and ensure regulatory compliance.
Extracting relevant insights from extensive regulatory documents is a complex
challenge requiring advanced retrieval and language models. This paper
introduces RiskData, a dataset specifically curated for finetuning embedding
models in risk management, and RiskEmbed, a finetuned embedding model designed
to improve retrieval accuracy in financial question-answering systems. The
dataset is derived from 94 regulatory guidelines published by the Office of the
Superintendent of Financial Institutions (OSFI) from 1991 to 2024. We finetune
a state-of-the-art sentence BERT embedding model to enhance domain-specific
retrieval performance typically for Retrieval-Augmented Generation (RAG)
systems. Experimental results demonstrate that RiskEmbed significantly
outperforms general-purpose and financial embedding models, achieving
substantial improvements in ranking metrics. By open-sourcing both the dataset
and the model, we provide a valuable resource for financial institutions and
researchers aiming to develop more accurate and efficient risk management AI
solutions.
Authors' comments: 10 pages, 3 figures, 2 tables, 1 equation
Leonardo Ranaldi, Barry Haddow, Alexandra Birch
Retrieval-augmented generation (RAG) has become a cornerstone of contemporary NLP, enhancing large language models (LLMs) by allowing them to access richer factual contexts through in-context retrieval. While effective in monolingual settings, especially in English, its use in multilingual tasks remains unexplored. This paper investigates the effectiveness of RAG across multiple languages by proposing novel approaches for multilingual open-domain question-answering. We evaluate the performance of various multilingual RAG strategies, including question-translation (tRAG), which translates questions into English before retrieval, and Multilingual RAG (MultiRAG), where retrieval occurs directly across multiple languages. Our findings reveal that tRAG, while useful, suffers from limited coverage. In contrast, MultiRAG improves efficiency by enabling multilingual retrieval but introduces inconsistencies due to cross-lingual variations in the retrieved content. To address these issues, we propose Crosslingual RAG (CrossRAG), a method that translates retrieved documents into a common language (e.g., English) before generating the response. Our experiments show that CrossRAG significantly enhances performance on knowledge-intensive tasks, benefiting both high-resource and low-resource languages.
Peter Baile Chen, Tomer Wolfson, Michael Cafarella, Dan Roth
Existing information retrieval systems excel in cases where the language of
target documents closely matches that of the user query. However, real-world
retrieval systems are often required to implicitly reason whether a document is
relevant. For example, when retrieving technical texts or tables, their
relevance to the user query may be implied through a particular jargon or
structure, rather than explicitly expressed in their content. Large language
models (LLMs) hold great potential in identifying such implied relevance by
leveraging their reasoning skills. Nevertheless, current LLM-augmented
retrieval is hindered by high latency and computation cost, as the LLM
typically computes the query-document relevance online, for every query anew.
To tackle this issue we introduce EnrichIndex, a retrieval approach which
instead uses the LLM offline to build semantically-enriched retrieval indices,
by performing a single pass over all documents in the retrieval corpus once
during ingestion time. Furthermore, the semantically-enriched indices can
complement existing online retrieval approaches, boosting the performance of
LLM re-rankers. We evaluated EnrichIndex on five retrieval tasks, involving
passages and tables, and found that it outperforms strong online LLM-based
retrieval systems, with an average improvement of 11.7 points in recall @ 10
and 10.6 points in NDCG @ 10 compared to strong baselines. In terms of online
calls to the LLM, it processes 293.3 times fewer tokens which greatly reduces
the online latency and cost. Overall, EnrichIndex is an effective way to build
better retrieval indices offline by leveraging the strong reasoning skills of
LLMs.
Authors' comments: Dataset and code are available at
https://peterbaile.github.io/enrichindex/
Prachi, Sumit Bhatia, Srikanta Bedathur
Multimodal representations are essential for cross-modal retrieval, but they often lack interpretability, making it difficult to understand the reasoning behind retrieved results. Sparse disentangled representations offer a promising solution; however, existing methods rely heavily on text tokens, resulting in high-dimensional embeddings. In this work, we propose a novel approach that generates compact, fixed-size embeddings that maintain disentanglement while providing greater control over retrieval tasks. We evaluate our method on challenging exclusion queries using the MSCOCO and Conceptual Captions benchmarks, demonstrating notable improvements over dense models like CLIP, BLIP, and VISTA (with gains of up to 11% in AP@10), as well as over sparse disentangled models like VDR (achieving up to 21% gains in AP@10). Furthermore, we present qualitative results that emphasize the enhanced interpretability of our disentangled representations.
Weitao Li, Kaiming Liu, Xiangyu Zhang, Xuanyu Lei, Weizhi Ma, Yang Liu
Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for knowledge integration during large language model (LLM) inference in recent years. However, current RAG implementations face challenges in effectively addressing noise, repetition and redundancy in retrieved content, primarily due to their limited ability to exploit fine-grained inter-document relationships. To address these limitations, we propose an \textbf{E}fficient \textbf{D}ynamic \textbf{C}lustering-based document \textbf{C}ompression framework (\textbf{EDC\textsuperscript{2}-RAG}) that effectively utilizes latent inter-document relationships while simultaneously removing irrelevant information and redundant content. We validate our approach, built upon GPT-3.5, on widely used knowledge-QA and hallucination-detected datasets. The results show that this method achieves consistent performance improvements across various scenarios and experimental settings, demonstrating strong robustness and applicability. Our code and datasets can be found at https://github.com/Tsinghua-dhy/EDC-2-RAG.
Alexandre Misrahi, Nadezhda Chirkova, Maxime Louis, Vassilina Nikoulina
Retrieval-Augmented Generation (RAG) enhances LLM factuality, but
multi-domain applications face challenges like lack of diverse benchmarks and
poor out-of-domain generalization. The first contribution of this work is to
introduce a diverse benchmark comprising a variety of question-answering tasks
from 8 sources and covering 13 domains. Our second contribution consists in
systematically testing out-of-domain generalization for typical RAG tuning
strategies. While our findings reveal that standard fine-tuning fails to
generalize effectively, we show that sequence-level distillation with
teacher-generated labels improves out-of-domain performance by providing more
coherent supervision. Our findings highlight key strategies for improving
multi-domain RAG robustness.
Authors' comments: 25 pages, 8 figures, 21 tables
Liangbo Ning, Wenqi Fan, Qing Li
Recently, Large Language Model (LLM)-empowered recommender systems have revolutionized personalized recommendation frameworks and attracted extensive attention. Despite the remarkable success, existing LLM-empowered RecSys have been demonstrated to be highly vulnerable to minor perturbations. To mitigate the negative impact of such vulnerabilities, one potential solution is to employ collaborative signals based on item-item co-occurrence to purify the malicious collaborative knowledge from the user's historical interactions inserted by attackers. On the other hand, due to the capabilities to expand insufficient internal knowledge of LLMs, Retrieval-Augmented Generation (RAG) techniques provide unprecedented opportunities to enhance the robustness of LLM-empowered recommender systems by introducing external collaborative knowledge. Therefore, in this paper, we propose a novel framework (RETURN) by retrieving external collaborative signals to purify the poisoned user profiles and enhance the robustness of LLM-empowered RecSys in a plug-and-play manner. Specifically, retrieval-augmented perturbation positioning is proposed to identify potential perturbations within the users' historical sequences by retrieving external knowledge from collaborative item graphs. After that, we further retrieve the collaborative knowledge to cleanse the perturbations by using either deletion or replacement strategies and introduce a robust ensemble recommendation strategy to generate final robust predictions. Extensive experiments on three real-world datasets demonstrate the effectiveness of the proposed RETURN.
Zidong Yu, Shuo Wang, Nan Jiang, Weiqiang Huang, Xu Han, Junliang Du
Harmful text detection has become a crucial task in the development and deployment of large language models, especially as AI-generated content continues to expand across digital platforms. This study proposes a joint retrieval framework that integrates pre-trained language models with knowledge graphs to improve the accuracy and robustness of harmful text detection. Experimental results demonstrate that the joint retrieval approach significantly outperforms single-model baselines, particularly in low-resource training scenarios and multilingual environments. The proposed method effectively captures nuanced harmful content by leveraging external contextual information, addressing the limitations of traditional detection models. Future research should focus on optimizing computational efficiency, enhancing model interpretability, and expanding multimodal detection capabilities to better tackle evolving harmful content patterns. This work contributes to the advancement of AI safety, ensuring more trustworthy and reliable content moderation systems.
Junlong Ren, Hao Wang
Cross-modal 3D retrieval is a critical yet challenging task, aiming to
achieve bi-directional retrieval between 3D and text modalities. Current
methods predominantly rely on a certain 3D representation (e.g., point cloud),
with few exploiting the 2D-3D consistency and complementary relationships,
which constrains their performance. To bridge this gap, we propose to adopt
multi-view images and point clouds to jointly represent 3D shapes, facilitating
tri-modal alignment (i.e., image, point, text) for enhanced cross-modal 3D
retrieval. Notably, we introduce tri-modal reconstruction to improve the
generalization ability of encoders. Given point features, we reconstruct image
features under the guidance of text features, and vice versa. With well-aligned
point cloud and multi-view image features, we aggregate them as multimodal
embeddings through fine-grained 2D-3D fusion to enhance geometric and semantic
understanding. Recognizing the significant noise in current datasets where many
3D shapes and texts share similar semantics, we employ hard negative
contrastive training to emphasize harder negatives with greater significance,
leading to robust discriminative embeddings. Extensive experiments on the
Text2Shape dataset demonstrate that our method significantly outperforms
previous state-of-the-art methods in both shape-to-text and text-to-shape
retrieval tasks by a substantial margin.
Authors' comments: ICME 2025
Yuji Nozawa, Yu-Chieh Lin, Kazumoto Nakamura, Youyang Ng
The goal of this paper is to enhance pretrained Vision Transformer (ViT)
models for focus-oriented image retrieval with visual prompting. In real-world
image retrieval scenarios, both query and database images often exhibit
complexity, with multiple objects and intricate backgrounds. Users often want
to retrieve images with specific object, which we define as the Focus-Oriented
Image Retrieval (FOIR) task. While a standard image encoder can be employed to
extract image features for similarity matching, it may not perform optimally in
the multi-object-based FOIR task. This is because each image is represented by
a single global feature vector. To overcome this, a prompt-based image
retrieval solution is required. We propose an approach called Prompt-guided
attention Head Selection (PHS) to leverage the head-wise potential of the
multi-head attention mechanism in ViT in a promptable manner. PHS selects
specific attention heads by matching their attention maps with user's visual
prompts, such as a point, box, or segmentation. This empowers the model to
focus on specific object of interest while preserving the surrounding visual
context. Notably, PHS does not necessitate model re-training and avoids any
image alteration. Experimental results show that PHS substantially improves
performance on multiple datasets, offering a practical and training-free
solution to enhance model performance in the FOIR task.
Authors' comments: Accepted to CVPR 2025 PixFoundation Workshop