Niall McGuire, Yashar Moshfeghi
Information Retrieval (IR) systems primarily rely on users' ability to translate their internal information needs into (text) queries. However, this translation process is often uncertain and cognitively demanding, leading to queries that incompletely or inaccurately represent users' true needs. This challenge is particularly acute for users with ill-defined information needs or physical impairments that limit traditional text input, where the gap between cognitive intent and query expression becomes even more pronounced. Recent neuroscientific studies have explored Brain-Machine Interfaces (BMIs) as a potential solution, aiming to bridge the gap between users' cognitive semantics and their search intentions. However, current approaches attempting to decode explicit text queries from brain signals have shown limited effectiveness in learning robust brain-to-text representations, often failing to capture the nuanced semantic information present in brain patterns. To address these limitations, we propose BPR (Brain Passage Retrieval), a novel framework that eliminates the need for intermediate query translation by enabling direct retrieval of relevant passages from users' brain signals. Our approach leverages dense retrieval architectures to map EEG signals and text passages into a shared semantic space. Through comprehensive experiments on the ZuCo dataset, we demonstrate that BPR achieves up to 8.81% improvement in precision@5 over existing EEG-to-text baselines, while maintaining effectiveness across 30 participants. Our ablation studies reveal the critical role of hard negative sampling and specialised brain encoders in achieving robust cross-modal alignment. These results establish the viability of direct brain-to-passage retrieval and provide a foundation for developing more natural interfaces between users' cognitive states and IR systems.
Shanti Stewart, Gouthaman KV, Lie Lu, Andrea Fanelli
Content creators often use music to enhance their videos, from soundtracks in
movies to background music in video blogs and social media content. However,
identifying the best music for a video can be a difficult and time-consuming
task. To address this challenge, we propose a novel framework for automatically
retrieving a matching music clip for a given video, and vice versa. Our
approach leverages annotated music labels, as well as the inherent artistic
correspondence between visual and music elements. Distinct from previous
cross-modal music retrieval works, our method combines both self-supervised and
supervised training objectives. We use self-supervised and label-supervised
contrastive learning to train a joint embedding space between music and video.
We show the effectiveness of our approach by using music genre labels for the
supervised training component, and our framework can be generalized to other
music annotations (e.g., emotion, instrument, etc.). Furthermore, our method
enables fine-grained control over how much the retrieval process focuses on
self-supervised vs. label information at inference time. We evaluate the
learned embeddings through a variety of video-to-music and music-to-video
retrieval tasks. Our experiments show that the proposed approach successfully
combines self-supervised and supervised objectives and is effective for
controllable music-video retrieval.
Authors' comments: 4 pages + 1 reference page, 2 figures, 2 tables. Under review
Wenliang Zhong, Weizhi An, Feng Jiang, Hehuan Ma, Yuzhi Guo, Junzhou Huang
Composed Image Retrieval (CIR) involves retrieving a target image based on a
composed query of an image paired with text that specifies modifications or
changes to the visual reference. CIR is inherently an instruction-following
task, as the model needs to interpret and apply modifications to the image. In
practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot
CIR (ZS-CIR) is desirable. While existing ZS-CIR models based on CLIP have
shown promising results, their capability in interpreting and following
modification instructions remains limited. Some research attempts to address
this by incorporating Large Language Models (LLMs). However, these approaches
still face challenges in effectively integrating multimodal information and
instruction understanding. To tackle above challenges, we propose a novel
embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to
generate composed representation, which significantly enhance the instruction
following capability for a comprehensive integration between images and
instructions. Nevertheless, directly applying MLLMs introduces a new challenge
since MLLMs are primarily designed for text generation rather than embedding
extraction as required in CIR. To address this, we introduce a two-stage
training strategy to efficiently learn a joint multimodal embedding space and
further refining the ability to follow modification instructions by tuning the
model in a triplet dataset similar to the CIR format. Extensive experiments on
four public datasets: FashionIQ, CIRR, GeneCIS, and CIRCO demonstrates the
superior performance of our model, outperforming state-of-the-art baselines by
a significant margin. Codes are available at the GitHub repository.
Authors' comments: 9 pages, 8 figures
Tatsuki Koga, Ruihan Wu, Kamalika Chaudhuri
With the recent remarkable advancement of large language models (LLMs), there has been a growing interest in utilizing them in the domains with highly sensitive data that lies outside their training data. For this purpose, retrieval augmented generation (RAG) is particularly effective -- it assists LLMs by directly providing relevant information from the external knowledge sources. However, without extra privacy safeguards, RAG outputs risk leaking sensitive information from the external data source. In this work, we explore RAG under differential privacy (DP), a formal guarantee of data privacy. The main challenge with differentially private RAG is how to generate long accurate answers within a moderate privacy budget. We address this by proposing an algorithm that smartly spends privacy budget only for the tokens that require the sensitive information and uses the non-private LLM for other tokens. Our extensive empirical evaluations reveal that our algorithm outperforms the non-RAG baseline under a reasonable privacy budget of $\epsilon\approx 10$ across different models and datasets.
Hermann Kroll, Pascal Sackhoff, Timo Breuer, Ralf Schenkel, Wolf-Tilo Balke
Keyword-based searches are today's standard in digital libraries. Yet,
complex retrieval scenarios like in scientific knowledge bases, need more
sophisticated access paths. Although each document somewhat contributes to a
domain's body of knowledge, the exact structure between keywords, i.e., their
possible relationships, and the contexts spanned within each single document
will be crucial for effective retrieval. Following this logic, individual
documents can be seen as small-scale knowledge graphs on which graph queries
can provide focused document retrieval. We implemented a full-fledged
graph-based discovery system for the biomedical domain and demonstrated its
benefits in the past. Unfortunately, graph-based retrieval methods generally
follow an 'exact match' paradigm, which severely hampers search efficiency,
since exact match results are hard to rank by relevance. This paper extends our
existing discovery system and contributes effective graph-based unsupervised
ranking methods, a new query relaxation paradigm, and ontological rewriting.
These extensions improve the system further so that users can retrieve results
with higher precision and higher recall due to partial matching and ontological
rewriting.
Authors' comments: Technical Report of our accepted paper at AI4LAC@JCDL2024. 11 pages,
5 figures
Vlad C. Andrei, Alexandru P. Drăguţoiu, Gabriel Béna, Mahmoud Akl, Yin Li, Matthias Lohrmann, Ullrich J. Mönich, Holger Boche
This paper explores the potential of conversion-based neuromorphic algorithms
for highly accurate and energy-efficient single-snapshot multidimensional
harmonic retrieval (MHR). By casting the MHR problem as a sparse recovery
problem, we devise the currently proposed, deep-unrolling-based Structured
Learned Iterative Shrinkage and Thresholding (S-LISTA) algorithm to solve it
efficiently using complex-valued convolutional neural networks with
complex-valued activations, which are trained using a supervised regression
objective. Afterward, a novel method for converting the complex-valued
convolutional layers and activations into spiking neural networks (SNNs) is
developed. At the heart of this method lies the recently proposed Few Spikes
(FS) conversion, which is extended by modifying the neuron model's parameters
and internal dynamics to account for the inherent coupling between real and
imaginary parts in complex-valued computations. Finally, the converted SNNs are
mapped onto the SpiNNaker2 neuromorphic board, and a comparison in terms of
estimation accuracy and power efficiency between the original CNNs deployed on
an NVIDIA Jetson Xavier and the SNNs is being conducted. The measurement
results show that the converted SNNs achieve almost five-fold power efficiency
at moderate performance loss compared to the original CNNs.
Authors' comments: accepted to the 58th Asilomar Conference on Signals, Systems, and
Computers, Oct. 27th - Oct. 30th, 2024, Pacific Grove, CA
Nikos Efthymiadis, Bill Psomas, Zakaria Laskar, Konstantinos Karantzalos, Yannis Avrithis, Ondřej Chum, Giorgos Tolias
This work addresses composed image retrieval in the context of domain
conversion, where the content of a query image is retrieved in the domain
specified by the query text. We show that a strong vision-language model
provides sufficient descriptive power without additional training. The query
image is mapped to the text input space using textual inversion. Unlike common
practice that invert in the continuous space of text tokens, we use the
discrete word space via a nearest-neighbor search in a text vocabulary. With
this inversion, the image is softly mapped across the vocabulary and is made
more robust using retrieval-based augmentation. Database images are retrieved
by a weighted ensemble of text queries combining mapped words with the domain
text. Our method outperforms prior art by a large margin on standard and newly
introduced benchmarks. Code: https://github.com/NikosEfth/freedom
Authors' comments: WACV 2025
Quang Hoang Trung, Nguyen Van Hoang Phuc, Le Trung Hoang, Quang Huu Hieu, Vo Nguyen Le Duy
Text Retrieval (TR) involves finding and retrieving text-based content relevant to a user's query from a large repository, with applications in real-world scenarios such as legal document retrieval. While most existing studies focus on English, limited work addresses Japanese contexts. In this paper, we introduce a new dataset specifically designed for Japanese legal contexts and propose a novel two-phase pipeline tailored to this domain. In the first phase, the model learns a broad understanding of global contexts, enhancing its generalization and adaptability to diverse queries. In the second phase, the model is fine-tuned to address complex queries specific to legal scenarios. Extensive experiments are conducted to demonstrate the superior performance of our method, which outperforms existing baselines. Furthermore, our pipeline proves effective in English contexts, surpassing comparable baselines on the MS MARCO dataset. We have made our code publicly available on GitHub, and the model checkpoints are accessible via HuggingFace.
Jean Bertin
This article introduces an innovative Retrieval Augmented Generation approach to similarity search. The proposed method uses a generative model to capture nuanced semantic information and retrieve similarity scores based on advanced context understanding. The study focuses on the BIOSSES dataset containing 100 pairs of sentences extracted from the biomedical domain, and introduces similarity search correlation results that outperform those previously attained on this dataset. Through an in-depth analysis of the model sensitivity, the research identifies optimal conditions leading to the highest similarity search accuracy: the results reveals high Pearson correlation scores, reaching specifically 0.905 at a temperature of 0.5 and a sample size of 20 examples provided in the prompt. The findings underscore the potential of generative models for semantic information retrieval and emphasize a promising research direction to similarity search.
Hieu Tran, Zonghai Yao, Junda Wang, Yifan Zhang, Zhichao Yang, Hong Yu
This work introduces RARE (Retrieval-Augmented Reasoning Enhancement), a
versatile extension to the mutual reasoning framework (rStar), aimed at
enhancing reasoning accuracy and factual integrity across large language models
(LLMs) for complex, knowledge-intensive tasks such as commonsense and medical
reasoning. RARE incorporates two innovative actions within the Monte Carlo Tree
Search (MCTS) framework: A6, which generates search queries based on the
initial problem statement, performs information retrieval using those queries,
and augments reasoning with the retrieved data to formulate the final answer;
and A7, which leverages information retrieval specifically for generated
sub-questions and re-answers these sub-questions with the relevant contextual
information. Additionally, a Retrieval-Augmented Factuality Scorer is proposed
to replace the original discriminator, prioritizing reasoning paths that meet
high standards of factuality. Experimental results with LLaMA 3.1 show that
RARE enables open-source LLMs to achieve competitive performance with top
open-source models like GPT-4 and GPT-4o. This research establishes RARE as a
scalable solution for improving LLMs in domains where logical coherence and
factual integrity are critical.
Authors' comments: 24 pages, 8 figures
Heejin Do, Sangwon Ryu, Jonghwi Kim, Gary Geunbae Lee
With the growing demand to fit fine-grained user intents, faceted query-by-example (QBE), which retrieves similar documents conditioned on specific facets, has gained recent attention. However, prior approaches mainly depend on document-level comparisons using basic indicators like citations due to the lack of facet-level relevance datasets; yet, this limits their use to citation-based domains and fails to capture the intricacies of facet constraints. In this paper, we propose a multi-facet blending (FaBle) augmentation method, which exploits modularity by decomposing and recomposing to explicitly synthesize facet-specific training sets. We automatically decompose documents into facet units and generate (ir)relevant pairs by leveraging LLMs' intrinsic distinguishing capabilities; then, dynamically recomposing the units leads to facet-wise relevance-informed document pairs. Our modularization eliminates the need for pre-defined facet knowledge or labels. Further, to prove the FaBle's efficacy in a new domain beyond citation-based scientific paper retrieval, we release a benchmark dataset for educational exam item QBE. FaBle augmentation on 1K documents remarkably assists training in obtaining facet conditional embeddings.
T. Y. S. S. Santosh, Hassan Sarwat, Matthias Grabmair
In this paper, we introduce QABISAR, a novel framework for statutory article
retrieval, to overcome the semantic mismatch problem when modeling each
query-article pair in isolation, making it hard to learn representation that
can effectively capture multi-faceted information. QABISAR leverages bipartite
interactions between queries and articles to capture diverse aspects inherent
in them. Further, we employ knowledge distillation to transfer enriched query
representations from the graph network into the query bi-encoder, to capture
the rich semantics present in the graph representations, despite absence of
graph-based supervision for unseen queries during inference. Our experiments on
a real-world expert-annotated dataset demonstrate its effectiveness.
Authors' comments: Accepted to COLING 2025
Son Pham Tien, Hieu Nguyen Doan, An Nguyen Dai, Sang Dinh Viet
In the field of legal information retrieval, effective embedding-based models are essential for accurate question-answering systems. However, the scarcity of large annotated datasets poses a significant challenge, particularly for Vietnamese legal texts. To address this issue, we propose a novel approach that leverages large language models to generate high-quality, diverse synthetic queries for Vietnamese legal passages. This synthetic data is then used to pre-train retrieval models, specifically bi-encoder and ColBERT, which are further fine-tuned using contrastive loss with mined hard negatives. Our experiments demonstrate that these enhancements lead to strong improvement in retrieval accuracy, validating the effectiveness of synthetic data and pre-training techniques in overcoming the limitations posed by the lack of large labeled datasets in the Vietnamese legal domain.
Alain Riou, Antonin Gagneré, Gaëtan Hadjeres, Stefan Lattner, Geoffroy Peeters
In this paper, we tackle the task of musical stem retrieval. Given a musical
mix, it consists in retrieving a stem that would fit with it, i.e., that would
sound pleasant if played together. To do so, we introduce a new method based on
Joint-Embedding Predictive Architectures, where an encoder and a predictor are
jointly trained to produce latent representations of a context and predict
latent representations of a target. In particular, we design our predictor to
be conditioned on arbitrary instruments, enabling our model to perform
zero-shot stem retrieval. In addition, we discover that pretraining the encoder
using contrastive learning drastically improves the model's performance.
We validate the retrieval performances of our model using the MUSDB18 and
MoisesDB datasets. We show that it significantly outperforms previous baselines
on both datasets, showcasing its ability to support more or less precise (and
possibly unseen) conditioning. We also evaluate the learned embeddings on a
beat tracking task, demonstrating that they retain temporal structure and local
information.
Authors' comments: Accepted to the IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP 2025)
Robin D. Pesl, Jerin G. Mathew, Massimo Mecella, Marco Aiello
Integrating multiple (sub-)systems is essential to create advanced Information Systems (ISs). Difficulties mainly arise when integrating dynamic environments across the IS lifecycle. A traditional approach is a registry that provides the API documentation of the systems' endpoints. Large Language Models (LLMs) have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input token limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. Within this work, we (i) analyze the usage of Retrieval Augmented Generation (RAG) for endpoint discovery and the chunking, i.e., preprocessing, of OpenAPIs to reduce the input token length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints and retrieves details on demand. We evaluate RAG for endpoint discovery using the RestBench benchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval recall, precision, and F1 score. Then, we assess the Discovery Agent using the same test set. With our prototype, we demonstrate how to successfully employ RAG for endpoint discovery to reduce the token count. While revealing high values for recall, precision, and F1, further research is necessary to retrieve all requisite endpoints. Our experiments show that for preprocessing, LLM-based and format-specific approaches outperform na\"ive chunking methods. Relying on an agent further enhances these results as the agent splits the tasks into multiple fine granular subtasks, improving the overall RAG performance in the token count, precision, and F1 score.
Muhammad Huzaifa, Yova Kementchedjhieva
Text-to-image retrieval is a critical task for managing diverse visual content, but common benchmarks for the task rely on small, single-domain datasets that fail to capture real-world complexity. Pre-trained vision-language models tend to perform well with easy negatives but struggle with hard negatives--visually similar yet incorrect images--especially in open-domain scenarios. To address this, we introduce Episodic Few-Shot Adaptation (EFSA), a novel test-time framework that adapts pre-trained models dynamically to a query's domain by fine-tuning on top-k retrieved candidates and synthetic captions generated for them. EFSA improves performance across diverse domains while preserving generalization, as shown in evaluations on queries from eight highly distinct visual domains and an open-domain retrieval pool of over one million images. Our work highlights the potential of episodic few-shot adaptation to enhance robustness in the critical and understudied task of open-domain text-to-image retrieval.
Cagatay Isil, Figen S. Oktem
In the phase retrieval problem, the aim is the recovery of an unknown image
from intensity-only measurements such as Fourier intensity. Although there are
several solution approaches, solving this problem is challenging due to its
nonlinear and ill-posed nature. Recently, learning-based approaches have
emerged as powerful alternatives to the analytical methods for several inverse
problems. In the context of phase retrieval, a novel plug-and-play approach
that exploits learning-based prior and efficient update steps has been
presented at the Computational Optical Sensing and Imaging topical meeting,
with demonstrated state-of-the-art performance. The key idea was to incorporate
learning-based prior to the Gerchberg-Saxton type algorithms through
plug-and-play regularization. In this paper, we present the mathematical
development of the method including the derivation of its analytical update
steps based on half-quadratic splitting and comparatively evaluate its
performance through extensive simulations on a large test dataset. The results
show the effectiveness of the method in terms of both image quality,
computational efficiency, and robustness to initialization and noise.
Authors' comments: 16 pages, 5 figures
Tian Yu, Shaolei Zhang, Yang Feng
Iterative retrieval refers to the process in which the model continuously
queries the retriever during generation to enhance the relevance of the
retrieved knowledge, thereby improving the performance of Retrieval-Augmented
Generation (RAG). Existing work typically employs few-shot prompting or
manually constructed rules to implement iterative retrieval. This introduces
additional inference overhead and overlooks the remarkable reasoning
capabilities of Large Language Models (LLMs). In this paper, we introduce
Auto-RAG, an autonomous iterative retrieval model centered on the LLM's
powerful decision-making capabilities. Auto-RAG engages in multi-turn dialogues
with the retriever, systematically planning retrievals and refining queries to
acquire valuable knowledge. This process continues until sufficient external
information is gathered, at which point the results are presented to the user.
To this end, we develop a method for autonomously synthesizing reasoning-based
decision-making instructions in iterative retrieval and fine-tuned the latest
open-source LLMs. The experimental results indicate that Auto-RAG is capable of
autonomous iterative interaction with the retriever, effectively leveraging the
remarkable reasoning and decision-making abilities of LLMs, which lead to
outstanding performance across six benchmarks. Further analysis reveals that
Auto-RAG can autonomously adjust the number of iterations based on the
difficulty of the questions and the utility of the retrieved knowledge, without
requiring any human intervention. Moreover, Auto-RAG expresses the iterative
retrieval process in natural language, enhancing interpretability while
providing users with a more intuitive experience\footnote{Code is available at
\url{https://github.com/ictnlp/Auto-RAG}.
Authors' comments: Code is available at https://github.com/ictnlp/Auto-RAG
Shengming Zhao, Yuheng Huang, Jiayang Song, Zhijie Wang, Chengcheng Wan, Lei Ma
Retrieval-Augmented Generation (RAG) is a pivotal technique for enhancing the capability of large language models (LLMs) and has demonstrated promising efficacy across a diverse spectrum of tasks. While LLM-driven RAG systems show superior performance, they face unique challenges in stability and reliability. Their complexity hinders developers' efforts to design, maintain, and optimize effective RAG systems. Therefore, it is crucial to understand how RAG's performance is impacted by its design. In this work, we conduct an early exploratory study toward a better understanding of the mechanism of RAG systems, covering three code datasets, three QA datasets, and two LLMs. We focus on four design factors: retrieval document type, retrieval recall, document selection, and prompt techniques. Our study uncovers how each factor impacts system correctness and confidence, providing valuable insights for developing an accurate and reliable RAG system. Based on these findings, we present nine actionable guidelines for detecting defects and optimizing the performance of RAG systems. We hope our early exploration can inspire further advancements in engineering, improving and maintaining LLM-driven intelligent software systems for greater efficiency and reliability.
Herbert Wright, Weiming Zhi, Matthew Johnson-Roberson, Tucker Hermans
Constructing 3D representations of object geometry is critical for many downstream robotics tasks, particularly tabletop manipulation problems. These representations must be built from potentially noisy partial observations. In this work, we focus on the problem of reconstructing a multi-object scene from a single RGBD image, generally from a fixed camera in the scene. Traditional scene representation methods generally cannot infer the geometry of unobserved regions of the objects from the image. Attempts have been made to leverage deep learning to train on a dataset of observed objects and representations, and then generalize to new observations. However, this can be brittle to noisy real-world observations and objects not contained in the dataset, and cannot reason about their confidence. We propose BRRP, a reconstruction method that leverages preexisting mesh datasets to build an informative prior during robust probabilistic reconstruction. In order to make our method more efficient, we introduce the concept of retrieval-augmented prior, where we retrieve relevant components of our prior distribution during inference. The prior is used to estimate the geometry of occluded portions of the in-scene objects. Our method produces a distribution over object shape that can be used for reconstruction or measuring uncertainty. We evaluate our method in both simulated scenes and in the real world. We demonstrate the robustness of our method against deep learning-only approaches while being more accurate than a method without an informative prior.