Danrui Li, Yichao Shi, Yaluo Wang, Ziying Shi, Mubbasir Kapadia
Efficiently searching for relevant case studies is critical in architectural
design, as designers rely on precedent examples to guide or inspire their
ongoing projects. However, traditional text-based search tools struggle to
capture the inherently visual and complex nature of architectural knowledge,
often leading to time-consuming and imprecise exploration. This paper
introduces ArchSeek, an innovative case study search system with recommendation
capability, tailored for architecture design professionals. Powered by the
visual understanding capabilities from vision-language models and cross-modal
embeddings, it enables text and image queries with fine-grained control, and
interaction-based design case recommendations. It offers architects a more
efficient, personalized way to discover design inspirations, with potential
applications across other visually driven design fields. The source code is
available at https://github.com/danruili/ArchSeek.
Authors' comments: 15 pages, 8 figures, 3 tables. Accepted by CAAD Futures 2025
Arun Reddy, Alexander Martin, Eugene Yang, Andrew Yates, Kate Sanders, Kenton Murray, Reno Kriz, Celso M. de Melo et al.
In this work, we tackle the problem of text-to-video retrieval (T2VR).
Inspired by the success of late interaction techniques in text-document,
text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a
simple and efficient mechanism for fine-grained similarity assessment between
queries and videos. Video-ColBERT is built upon 3 main components: a
fine-grained spatial and temporal token-wise interaction, query and visual
expansions, and a dual sigmoid loss during training. We find that this
interaction and training paradigm leads to strong individual, yet compatible,
representations for encoding video content. These representations lead to
increases in performance on common text-to-video retrieval benchmarks compared
to other bi-encoder methods.
Authors' comments: Accepted at CVPR 2025. 13 pages, 4 figures. Approved for public
release: distribution unlimited
Hongru Cai, Yongqi Li, Ruifeng Yuan, Wenjie Wang, Zhen Zhang, Wenjie Li, Tat-Seng Chua
Generative retrieval reformulates retrieval as an autoregressive generation
task, where large language models (LLMs) generate target documents directly
from a query. As a novel paradigm, the mechanisms that underpin its performance
and scalability remain largely unexplored. We systematically investigate
training and inference scaling laws in generative retrieval, exploring how
model size, training data scale, and inference-time compute jointly influence
performance. We propose a novel evaluation metric inspired by contrastive
entropy and generation loss, providing a continuous performance signal that
enables robust comparisons across diverse generative retrieval methods. Our
experiments show that n-gram-based methods align strongly with training and
inference scaling laws. We find that increasing model size, training data
scale, and inference-time compute all contribute to improved performance,
highlighting the complementary roles of these factors in enhancing generative
retrieval. Across these settings, LLaMA models consistently outperform T5
models, suggesting a particular advantage for larger decoder-only models in
generative retrieval. Our findings underscore that model sizes, data
availability, and inference computation interact to unlock the full potential
of generative retrieval, offering new insights for designing and optimizing
future systems.
Authors' comments: Accepted to SIGIR 2025
V Venktesh, Mandeep Rathee, Avishek Anand
Complex question-answering (QA) systems face significant challenges in
retrieving and reasoning over information that addresses multi-faceted queries.
While large language models (LLMs) have advanced the reasoning capabilities of
these systems, the bounded-recall problem persists, where procuring all
relevant documents in first-stage retrieval remains a challenge. Missing
pertinent documents at this stage leads to performance degradation that cannot
be remedied in later stages, especially given the limited context windows of
LLMs which necessitate high recall at smaller retrieval depths. In this paper,
we introduce SUNAR, a novel approach that leverages LLMs to guide a
Neighborhood Aware Retrieval process. SUNAR iteratively explores a neighborhood
graph of documents, dynamically promoting or penalizing documents based on
uncertainty estimates from interim LLM-generated answer candidates. We validate
our approach through extensive experiments on two complex QA datasets. Our
results show that SUNAR significantly outperforms existing retrieve-and-reason
baselines, achieving up to a 31.84% improvement in performance over existing
state-of-the-art methods for complex QA.
Authors' comments: Accepted at NAACL 2025 Main Conference
Bokai Cao, Xueyuan Lin, Yiyan Qi, Chengjin Xu, Cehao Yang, Jian Guo
Market simulator tries to create high-quality synthetic financial data that mimics real-world market dynamics, which is crucial for model development and robust assessment. Despite continuous advancements in simulation methodologies, market fluctuations vary in terms of scale and sources, but existing frameworks often excel in only specific tasks. To address this challenge, we propose Financial Wind Tunnel (FWT), a retrieval-augmented market simulator designed to generate controllable, reasonable, and adaptable market dynamics for model testing. FWT offers a more comprehensive and systematic generative capability across different data frequencies. By leveraging a retrieval method to discover cross-sectional information as the augmented condition, our diffusion-based simulator seamlessly integrates both macro- and micro-level market patterns. Furthermore, our framework allows the simulation to be controlled with wide applicability, including causal generation through "what-if" prompts or unprecedented cross-market trend synthesis. Additionally, we develop an automated optimizer for downstream quantitative models, using stress testing of simulated scenarios via FWT to enhance returns while controlling risks. Experimental results demonstrate that our approach enables the generalizable and reliable market simulation, significantly improve the performance and adaptability of downstream models, particularly in highly complex and volatile market conditions. Our code and data sample is available at https://anonymous.4open.science/r/fwt_-E852
Pranavi Kolouju, Eric Xing, Robert Pless, Nathan Jacobs, Abby Stylianou
Composed image retrieval (CIR) enables users to search images using a reference image combined with textual modifications. Recent advances in vision-language models have improved CIR, but dataset limitations remain a barrier. Existing datasets often rely on simplistic, ambiguous, or insufficient manual annotations, hindering fine-grained retrieval. We introduce good4cir, a structured pipeline leveraging vision-language models to generate high-quality synthetic annotations. Our method involves: (1) extracting fine-grained object descriptions from query images, (2) generating comparable descriptions for target images, and (3) synthesizing textual instructions capturing meaningful transformations between images. This reduces hallucination, enhances modification diversity, and ensures object-level consistency. Applying our method improves existing datasets and enables creating new datasets across diverse domains. Results demonstrate improved retrieval accuracy for CIR models trained on our pipeline-generated datasets. We release our dataset construction framework to support further research in CIR and multi-modal retrieval.
Felix Faltings, Wei Wei, Yujia Bao
Traditional retrieval methods rely on transforming user queries into vector representations and retrieving documents based on cosine similarity within an embedding space. While efficient and scalable, this approach often fails to handle complex queries involving logical constructs such as negations, conjunctions, and disjunctions. In this paper, we propose a novel inference-time logical reasoning framework that explicitly incorporates logical reasoning into the retrieval process. Our method extracts logical reasoning structures from natural language queries and then composes the individual cosine similarity scores to formulate the final document scores. This approach enables the retrieval process to handle complex logical reasoning without compromising computational efficiency. Our results on both synthetic and real-world benchmarks demonstrate that the proposed method consistently outperforms traditional retrieval methods across different models and datasets, significantly improving retrieval performance for complex queries.
Yicheng Duan, Xi Huang, Duo Chen
The rapid growth of video content demands efficient and precise retrieval systems. While vision-language models (VLMs) excel in representation learning, they often struggle with adaptive, time-sensitive video retrieval. This paper introduces a novel framework that combines vector similarity search with graph-based data structures. By leveraging VLM embeddings for initial retrieval and modeling contextual relationships among video segments, our approach enables adaptive query refinement and improves retrieval accuracy. Experiments demonstrate its precision, scalability, and robustness, offering an effective solution for interactive video retrieval in dynamic environments.
Alex-Razvan Ispas, Charles-Elie Simon, Fabien Caspani, Vincent Guigue
Large Language Models are prompting us to view more NLP tasks from a
generative perspective. At the same time, they offer a new way of accessing
information, mainly through the RAG framework. While there have been notable
improvements for the autoregressive models, overcoming hallucination in the
generated answers remains a continuous problem. A standard solution is to use
commercial LLMs, such as GPT4, to evaluate these algorithms. However, such
frameworks are expensive and not very transparent. Therefore, we propose a
study which demonstrates the interest of open-weight models for evaluating RAG
hallucination. We develop a lightweight approach using smaller, quantized LLMs
to provide an accessible and interpretable metric that gives continuous scores
for the generated answer with respect to their correctness and faithfulness.
This score allows us to question decisions' reliability and explore thresholds
to develop a new AUC metric as an alternative to correlation with human
judgment.
Authors' comments: 17 pages, 5 figures, published at 1st workshop of Quantify
Uncertainty and Hallucination in Foundation Models: The Next Frontier in
Reliable AI at ICLR 25
Ghadir Alselwi, Hao Xue, Shoaib Jameel, Basem Suleiman, Flora D. Salim, Imran Razzak
Effective long-term memory management is crucial for language models handling extended contexts. We introduce a novel framework that dynamically ranks memory entries based on relevance. Unlike previous works, our model introduces a novel relevance scoring and a pointwise re-ranking model for key-value embeddings, inspired by learning-to-rank techniques in information retrieval. Enhanced Ranked Memory Augmented Retrieval ERMAR achieves state-of-the-art results on standard benchmarks.
Wenqi Jiang, Suvinay Subramanian, Cat Graves, Gustavo Alonso, Amir Yazdanbakhsh, Vidushi Dadu
Retrieval-augmented generation (RAG), which combines large language models (LLMs) with retrievals from external knowledge databases, is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. In this paper, we make three fundamental contributions to advancing RAG serving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across these workloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. Our evaluation shows that RAGO achieves up to a 2x increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.
Seyoung Song
We introduce a novel large language model (LLM)-driven agent framework, which iteratively refines queries and filters contextual evidence by leveraging dynamically evolving knowledge. A defining feature of the system is its decoupling of external sources from an internal knowledge cache that is progressively updated to guide both query generation and evidence selection. This design mitigates bias-reinforcement loops and enables dynamic, trackable search exploration paths, thereby optimizing the trade-off between exploring diverse information and maintaining accuracy through autonomous agent decision-making. Our approach is evaluated on a broad range of open-domain question answering benchmarks, including multi-step tasks that mirror real-world scenarios where integrating information from multiple sources is critical, especially given the vulnerabilities of LLMs that lack explicit reasoning or planning capabilities. The results show that the proposed system not only outperforms single-step baselines regardless of task difficulty but also, compared to conventional iterative retrieval methods, demonstrates pronounced advantages in complex tasks through precise evidence-based reasoning and enhanced efficiency. The proposed system supports both competitive and collaborative sharing of updated context, enabling multi-agent extension. The benefits of multi-agent configurations become especially prominent as task difficulty increases. The number of convergence steps scales with task difficulty, suggesting cost-effective scalability.
Pengcheng Zhou, Yinglun Feng, Zhongliang Yang
The widespread adoption of Retrieval-Augmented Generation (RAG) systems in real-world applications has heightened concerns about the confidentiality and integrity of their proprietary knowledge bases. These knowledge bases, which play a critical role in enhancing the generative capabilities of Large Language Models (LLMs), are increasingly vulnerable to breaches that could compromise sensitive information. To address these challenges, this paper proposes an advanced encryption methodology designed to protect RAG systems from unauthorized access and data leakage. Our approach encrypts both textual content and its corresponding embeddings prior to storage, ensuring that all data remains securely encrypted. This mechanism restricts access to authorized entities with the appropriate decryption keys, thereby significantly reducing the risk of unintended data exposure. Furthermore, we demonstrate that our encryption strategy preserves the performance and functionality of RAG pipelines, ensuring compatibility across diverse domains and applications. To validate the robustness of our method, we provide comprehensive security proofs that highlight its resilience against potential threats and vulnerabilities. These proofs also reveal limitations in existing approaches, which often lack robustness, adaptability, or reliance on open-source models. Our findings suggest that integrating advanced encryption techniques into the design and deployment of RAG systems can effectively enhance privacy safeguards. This research contributes to the ongoing discourse on improving security measures for AI-driven services and advocates for stricter data protection standards within RAG architectures.
Yixiong Fang, Tianran Sun, Yuling Shi, Xiaodong Gu
While RAG demonstrates remarkable capabilities in LLM applications, its effectiveness is hindered by the ever-increasing length of retrieved contexts, which introduces information redundancy and substantial computational overhead. Existing context pruning methods, such as LLMLingua, lack contextual awareness and offer limited flexibility in controlling compression rates, often resulting in either insufficient pruning or excessive information loss. In this paper, we propose AttentionRAG, an attention-guided context pruning method for RAG systems. The core idea of AttentionRAG lies in its attention focus mechanism, which reformulates RAG queries into a next-token prediction paradigm. This mechanism isolates the query's semantic focus to a single token, enabling precise and efficient attention calculation between queries and retrieved contexts. Extensive experiments on LongBench and Babilong benchmarks show that AttentionRAG achieves up to 6.3$\times$ context compression while outperforming LLMLingua methods by around 10\% in key metrics.
Huaying Yuan, Zheng Liu, Minhao Qin, Hongjin Qian, Y Shu, Zhicheng Dou, Ji-Rong Wen
Retrieval-augmented generation (RAG) shows strong potential in addressing long-video understanding (LVU) tasks. However, traditional RAG methods remain fundamentally limited due to their dependence on explicit search queries, which are unavailable in many situations. To overcome this challenge, we introduce a novel RAG-based LVU approach inspired by the cognitive memory of human beings, which is called MemVid. Our approach operates with four basics steps: memorizing holistic video information, reasoning about the task's information needs based on the memory, retrieving critical moments based on the information needs, and focusing on the retrieved moments to produce the final answer. To enhance the system's memory-grounded reasoning capabilities and achieve optimal end-to-end performance, we propose a curriculum learning strategy. This approach begins with supervised learning on well-annotated reasoning results, then progressively explores and reinforces more plausible reasoning outcomes through reinforcement learning. We perform extensive evaluations on popular LVU benchmarks, including MLVU, VideoMME and LVBench. In our experiment, MemVid significantly outperforms existing RAG-based methods and popular LVU models, which demonstrate the effectiveness of our approach. Our model and source code will be made publicly available upon acceptance.
Kevin Qinghong Lin, Mike Zheng Shou
Human daily activities can be concisely narrated as sequences of routine
events (e.g., turning off an alarm) in video streams, forming an event
vocabulary. Motivated by this, we introduce VLog, a novel video understanding
framework that define video narrations as vocabulary, going beyond the typical
subword vocabularies in existing generative video-language models. Built on the
lightweight language model GPT-2, VLog feature three key innovations: (i) A
generative retrieval model, marrying language model's complex reasoning
capabilities with contrastive retrieval's flexible upgrading over narration
vocabulary. (ii) A hierarchical vocabulary derived from large-scale video
narrations using our narration pair encoding algorithm, enabling efficient
indexing of specific events (e.g., cutting a tomato) by identifying broader
scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand).
(iii) A vocabulary update strategy leveraging generative models to extend the
vocabulary for novel events encountered during inference. To validate our
approach, we introduce VidCap-Eval, a development set requiring concise
narrations with reasoning relationships (e.g., before and after). Experiments
on EgoSchema, COIN, and HiREST further demonstrate the effectiveness of VLog,
highlighting its ability to generate concise, contextually accurate, and
efficient narrations, offering a novel perspective on video understanding.
Codes are released at https://github.com/showlab/VLog.
Authors' comments: Accepted by CVPR 2025. Github: https://github.com/showlab/VLog
Haoyu Wang, Sunhao Dai, Haiyuan Zhao, Liang Pang, Xiao Zhang, Gang Wang, Zhenhua Dong, Jun Xu et al.
Previous studies have found that PLM-based retrieval models exhibit a
preference for LLM-generated content, assigning higher relevance scores to
these documents even when their semantic quality is comparable to human-written
ones. This phenomenon, known as source bias, threatens the sustainable
development of the information access ecosystem. However, the underlying causes
of source bias remain unexplored. In this paper, we explain the process of
information retrieval with a causal graph and discover that PLM-based
retrievers learn perplexity features for relevance estimation, causing source
bias by ranking the documents with low perplexity higher. Theoretical analysis
further reveals that the phenomenon stems from the positive correlation between
the gradients of the loss functions in language modeling task and retrieval
task. Based on the analysis, a causal-inspired inference-time debiasing method
is proposed, called Causal Diagnosis and Correction (CDC). CDC first diagnoses
the bias effect of the perplexity and then separates the bias effect from the
overall estimated relevance score. Experimental results across three domains
demonstrate the superior debiasing effectiveness of CDC, emphasizing the
validity of our proposed explanatory framework. Source codes are available at
https://github.com/WhyDwelledOnAi/Perplexity-Trap.
Authors' comments: ICLR 2025
Phu-Vinh Nguyen, Minh-Nam Tran, Long Nguyen, Dien Dinh
With the rapid development of natural language processing, many language models have been invented for multiple tasks. One important task is information retrieval (IR), which requires models to retrieve relevant documents. Despite its importance in many real-life applications, especially in retrieval augmented generation (RAG) systems, this task lacks Vietnamese benchmarks. This situation causes difficulty in assessing and comparing many existing Vietnamese embedding language models on the task and slows down the advancement of Vietnamese natural language processing (NLP) research. In this work, we aim to provide the Vietnamese research community with a new benchmark for information retrieval, which mainly focuses on retrieval and reranking tasks. Furthermore, we also present a new objective function based on the InfoNCE loss function, which is used to train our Vietnamese embedding model. Our function aims to be better than the origin in information retrieval tasks. Finally, we analyze the effect of temperature, a hyper-parameter in both objective functions, on the performance of text embedding models.
Mingyue Cheng, Yucong Luo, Jie Ouyang, Qi Liu, Huijie Liu, Li Li, Shuo Yu, Bohou Zhang et al.
Retrieval-Augmented Generation (RAG) has gained significant attention in recent years for its potential to enhance natural language understanding and generation by combining large-scale retrieval systems with generative models. RAG leverages external knowledge sources, such as documents, databases, or structured data, to improve model performance and generate more accurate and contextually relevant outputs. This survey aims to provide a comprehensive overview of RAG by examining its fundamental components, including retrieval mechanisms, generation processes, and the integration between the two. We discuss the key characteristics of RAG, such as its ability to augment generative models with dynamic external knowledge, and the challenges associated with aligning retrieved information with generative objectives. We also present a taxonomy that categorizes RAG methods, ranging from basic retrieval-augmented approaches to more advanced models incorporating multi-modal data and reasoning capabilities. Additionally, we review the evaluation benchmarks and datasets commonly used to assess RAG systems, along with a detailed exploration of its applications in fields such as question answering, summarization, and information retrieval. Finally, we highlight emerging research directions and opportunities for improving RAG systems, such as enhanced retrieval efficiency, model interpretability, and domain-specific adaptations. This paper concludes by outlining the prospects for RAG in addressing real-world challenges and its potential to drive further advancements in natural language processing.
Hanze Li, Xiande Huang
Growing evidence suggests that layer attention mechanisms, which enhance
interaction among layers in deep neural networks, have significantly advanced
network architectures. However, existing layer attention methods suffer from
redundancy, as attention weights learned by adjacent layers often become highly
similar. This redundancy causes multiple layers to extract nearly identical
features, reducing the model's representational capacity and increasing
training time. To address this issue, we propose a novel approach to quantify
redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent
layers. Additionally, we introduce an Enhanced Beta Quantile Mapping (EBQM)
method that accurately identifies and skips redundant layers, thereby
maintaining model stability. Our proposed Efficient Layer Attention (ELA)
architecture, improves both training efficiency and overall performance,
achieving a 30\% reduction in training time while enhancing performance in
tasks such as image classification and object detection.
Authors' comments: 11 pages, 7 figures