Konstantin Schall, Kai Uwe Barthel, Nico Hezel, Klaus Jung
Contrastive Language and Image Pairing (CLIP), a transformative method in multimedia retrieval, typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. However, when applied directly, these models often struggle to differentiate between visually distinct images that have similar captions, resulting in suboptimal performance for image-based similarity searches. This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios, while maintaining their effectiveness in text-based search tasks such as text-to-image retrieval and zero-shot classification. We propose and evaluate two novel methods aimed at refining the retrieval capabilities of CLIP without compromising the alignment between text and image embeddings. The first method involves a sequential fine-tuning process: initially optimizing the image encoder for more precise image retrieval and subsequently realigning the text encoder to these optimized image embeddings. The second approach integrates pseudo-captions during the retrieval-optimization phase to foster direct alignment within the embedding space. Through comprehensive experiments, we demonstrate that these methods enhance CLIP's performance on various benchmarks, including image retrieval, k-NN classification, and zero-shot text-based classification, while maintaining robustness in text-to-image retrieval. Our optimized models permit maintaining a single embedding per image, significantly simplifying the infrastructure needed for large-scale multi-modal similarity search systems.
Nicholas Monath, Will Grathwohl, Michael Boratko, Rob Fergus, Andrew McCallum, Manzil Zaheer
In dense retrieval, deep encoders provide embeddings for both inputs and
targets, and the softmax function is used to parameterize a distribution over a
large number of candidate targets (e.g., textual passages for information
retrieval). Significant challenges arise in training such encoders in the
increasingly prevalent scenario of (1) a large number of targets, (2) a
computationally expensive target encoder model, (3) cached target embeddings
that are out-of-date due to ongoing training of target encoder parameters. This
paper presents a simple and highly scalable response to these challenges by
training a small parametric corrector network that adjusts stale cached target
embeddings, enabling an accurate softmax approximation and thereby sampling of
up-to-date high scoring "hard negatives." We theoretically investigate the
generalization properties of our proposed target corrector, relating the
complexity of the network, staleness of cached representations, and the amount
of training data. We present experimental results on large benchmark dense
retrieval datasets as well as on QA with retrieval augmented language models.
Our approach matches state-of-the-art results even when no target embedding
updates are made during training beyond an initial cache from the unsupervised
pre-trained model, providing a 4-80x reduction in re-embedding computational
cost.
Authors' comments: ICML 2024
Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Zhiming Zheng
Retrieved documents containing noise will hinder RAG from detecting answer
clues and make the inference process slow and expensive. Therefore, context
compression is necessary to enhance its accuracy and efficiency. Existing
context compression methods use extractive or generative models to retain the
most query-relevant sentences or apply the information bottleneck theory to
preserve sufficient information. However, these methods may face issues such as
over-compression or high computational costs. We observe that the retriever
often ranks relevant documents at the top, but the exact number of documents
needed to answer the query is uncertain due to the impact of query complexity
and retrieval quality: complex queries like multi-hop questions may require
retaining more documents than simpler queries, and a low-quality retrieval may
need to rely on more documents to generate accurate outputs. Therefore,
determining the minimum number of required documents (compression rate) is
still a challenge for RAG. In this paper, we introduce AdaComp, a low-cost
extractive context compression method that adaptively determines the
compression rate based on both query complexity and retrieval quality.
Specifically, we first annotate the minimum top-k documents necessary for the
RAG system to answer the current query as the compression rate and then
construct triplets of the query, retrieved documents, and its compression rate.
Then, we use this triplet dataset to train a compression-rate predictor.
Experiments on three QA datasets and one conversational Muiti-doc QA dataset
show that AdaComp significantly reduces inference costs while maintaining
performance nearly identical to uncompressed models, achieving a balance
between efficiency and performance.
Authors' comments: 8 pages, 5 figures, code available at
https://anonymous.4open.science/r/AdaComp-8C0C/
Yeonjun In, Sungchul Kim, Ryan A. Rossi, Md Mehrab Tanjim, Tong Yu, Ritwik Sinha, Chanyoung Park
The retrieval augmented generation (RAG) framework addresses an ambiguity in
user queries in QA systems by retrieving passages that cover all plausible
interpretations and generating comprehensive responses based on the passages.
However, our preliminary studies reveal that a single retrieval process often
suffers from low quality results, as the retrieved passages frequently fail to
capture all plausible interpretations. Although the iterative RAG approach has
been proposed to address this problem, it comes at the cost of significantly
reduced efficiency. To address these issues, we propose the
diversify-verify-adapt (DIVA) framework. DIVA first diversifies the retrieved
passages to encompass diverse interpretations. Subsequently, DIVA verifies the
quality of the passages and adapts the most suitable approach tailored to their
quality. This approach improves the QA systems accuracy and robustness by
handling low quality retrieval issue in ambiguous questions, while enhancing
efficiency.
Authors' comments: NAACL 2025 Main
Shuo Yu, Mingyue Cheng, Jiqian Yang, Jie Ouyang, Yucong Luo, Chenyi Lei, Qi Liu, Enhong Chen
Retrieval-augmented generation (RAG) is increasingly recognized as an
effective approach to mitigating the hallucination of large language models
(LLMs) through the integration of external knowledge. While numerous efforts,
most studies focus on a single type of external knowledge source. In contrast,
most real-world applications involve diverse knowledge from various sources, a
scenario that has been relatively underexplored. The main dilemma is the lack
of a suitable dataset incorporating multiple knowledge sources and
pre-exploration of the associated issues. To address these challenges, we
standardize a benchmark dataset that combines structured and unstructured
knowledge across diverse and complementary domains. Building upon the dataset,
we identify the limitations of existing methods under such conditions.
Therefore, we develop PruningRAG, a plug-and-play RAG framework that uses
multi-granularity pruning strategies to more effectively incorporate relevant
context and mitigate the negative impact of misleading information. Extensive
experimental results demonstrate superior performance of PruningRAG and our
insightful findings are also reported. Our dataset and code are publicly
available\footnote{https://github.com/USTCAGI/PruningRAG}.
Authors' comments: 12 pages, 9 figures;
Antoine Louis, Gijs van Dijck, Gerasimos Spanakis
Hybrid search has emerged as an effective strategy to offset the limitations
of different matching paradigms, especially in out-of-domain contexts where
notable improvements in retrieval quality have been observed. However, existing
research predominantly focuses on a limited set of retrieval methods, evaluated
in pairs on domain-general datasets exclusively in English. In this work, we
study the efficacy of hybrid search across a variety of prominent retrieval
models within the unexplored field of law in the French language, assessing
both zero-shot and in-domain scenarios. Our findings reveal that in a zero-shot
context, fusing different domain-general models consistently enhances
performance compared to using a standalone model, regardless of the fusion
method. Surprisingly, when models are trained in-domain, we find that fusion
generally diminishes performance relative to using the best single system,
unless fusing scores with carefully tuned weights. These novel insights, among
others, expand the applicability of prior findings across a new field and
language, and contribute to a deeper understanding of hybrid search in
non-English specialized domains.
Authors' comments: Under review
Ishan Rajendrakumar Dave, Fabian Caba Heilbron, Mubarak Shah, Simon Jenni
Temporal video alignment aims to synchronize the key events like object
interactions or action phase transitions in two videos. Such methods could
benefit various video editing, processing, and understanding tasks. However,
existing approaches operate under the restrictive assumption that a suitable
video pair for alignment is given, significantly limiting their broader
applicability. To address this, we re-pose temporal alignment as a search
problem and introduce the task of Alignable Video Retrieval (AVR). Given a
query video, our approach can identify well-alignable videos from a large
collection of clips and temporally synchronize them to the query. To achieve
this, we make three key contributions: 1) we introduce DRAQ, a video
alignability indicator to identify and re-rank the best alignable video from a
set of candidates; 2) we propose an effective and generalizable frame-level
video feature design to improve the alignment performance of several
off-the-shelf feature representations, and 3) we propose a novel benchmark and
evaluation protocol for AVR using cycle-consistency metrics. Our experiments on
3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of
our approach in identifying alignable video pairs from diverse datasets.
Project Page: https://daveishan.github.io/avr-webpage/.
Authors' comments: ECCV 2024 Oral
Kim Jinwoo
Image retrieval is a crucial research topic in computer vision, with broad application prospects ranging from online product searches to security surveillance systems. In recent years, the accuracy and efficiency of image retrieval have significantly improved due to advancements in deep learning. However, existing methods still face numerous challenges, particularly in handling large-scale datasets, cross-domain retrieval, and image perturbations that can arise from real-world conditions such as variations in lighting, occlusion, and viewpoint. Data augmentation techniques and adversarial learning methods have been widely applied in the field of image retrieval to address these challenges. Data augmentation enhances the model's generalization ability and robustness by generating more diverse training samples, simulating real-world variations, and reducing overfitting. Meanwhile, adversarial attacks and defenses introduce perturbations during training to improve the model's robustness against potential attacks, ensuring reliability in practical applications. This review comprehensively summarizes the latest research advancements in image retrieval, with a particular focus on the roles of data augmentation and adversarial learning techniques in enhancing retrieval performance. Future directions and potential challenges are also discussed.
Jaeyeon Kim, Jaeyoon Jung, Minjeong Jeon, Sang Hoon Woo, Jinjoo Lee
In this technical report, we describe our submission to DCASE2024 Challenge
Task6 (Automated Audio Captioning) and Task8 (Language-based Audio Retrieval).
We develop our approach building upon the EnCLAP audio captioning framework and
optimizing it for Task6 of the challenge. Notably, we outline the changes in
the underlying components and the incorporation of the reranking process.
Additionally, we submit a supplementary retriever model, a byproduct of our
modified framework, to Task8. Our proposed systems achieve FENSE score of 0.542
on Task6 and mAP@10 score of 0.386 on Task8, significantly outperforming the
baseline models.
Authors' comments: DCASE2024 Challenge Technical Report. Ranked 2nd in Task 6 Automated
Audio Captioning
Dongil Yang, Suyeon Lee, Minjin Kim, Jungsoo Won, Namyoung Kim, Dongha Lee, Jinyoung Yeo
Engagement between instructors and students plays a crucial role in enhancing
students'academic performance. However, instructors often struggle to provide
timely and personalized support in large classes. To address this challenge, we
propose a novel Virtual Teaching Assistant (VTA) named YA-TA, designed to offer
responses to students that are grounded in lectures and are easy to understand.
To facilitate YA-TA, we introduce the Dual Retrieval-augmented Knowledge Fusion
(DRAKE) framework, which incorporates dual retrieval of instructor and student
knowledge and knowledge fusion for tailored response generation. Experiments
conducted in real-world classroom settings demonstrate that the DRAKE framework
excels in aligning responses with knowledge retrieved from both instructor and
student sides. Furthermore, we offer additional extensions of YA-TA, such as a
Q&A board and self-practice tools to enhance the overall learning experience.
Our video is publicly available.
Authors' comments: 9 pages, 5 figures
Yujing Wang, Hainan Zhang, Liang Pang, Liang Pang, Hongwei Zheng, Zhiming Zheng
In a real-world RAG system, the current query often involves spoken ellipses and ambiguous references from dialogue contexts, necessitating query rewriting to better describe user's information needs. However, traditional context-based rewriting has minimal enhancement on downstream generation tasks due to the lengthy process from query rewriting to response generation. Some researchers try to utilize reinforcement learning with generation feedback to assist the rewriter, but these sparse rewards provide little guidance in most cases, leading to unstable training and generation results. We find that user's needs are also reflected in the gold document, retrieved documents and ground truth. Therefore, by feeding back these multi-aspect dense rewards to query rewriting, more stable and satisfactory responses can be achieved. In this paper, we propose a novel query rewriting method MaFeRw, which improves RAG performance by integrating multi-aspect feedback from both the retrieval process and generated results. Specifically, we first use manual data to train a T5 model for the rewriter initialization. Next, we design three metrics as reinforcement learning feedback: the similarity between the rewritten query and the gold document, the ranking metrics, and ROUGE between the generation and the ground truth. Inspired by RLAIF, we train three kinds of reward models for the above metrics to achieve more efficient training. Finally, we combine the scores of these reward models as feedback, and use PPO algorithm to explore the optimal query rewriting strategy. Experimental results on two conversational RAG datasets demonstrate that MaFeRw achieves superior generation metrics and more stable training compared to baselines.
Rishi Kalra, Zekun Wu, Ayesha Gulley, Airlie Hilliard, Xin Guan, Adriano Koshiyama, Philip Treleaven
Large Language Models (LLMs) face limitations in AI legal and policy
applications due to outdated knowledge, hallucinations, and poor reasoning in
complex contexts. Retrieval-Augmented Generation (RAG) systems address these
issues by incorporating external knowledge, but suffer from retrieval errors,
ineffective context integration, and high operational costs. This paper
presents the Hybrid Parameter-Adaptive RAG (HyPA-RAG) system, designed for the
AI legal domain, with NYC Local Law 144 (LL144) as the test case. HyPA-RAG
integrates a query complexity classifier for adaptive parameter tuning, a
hybrid retrieval approach combining dense, sparse, and knowledge graph methods,
and a comprehensive evaluation framework with tailored question types and
metrics. Testing on LL144 demonstrates that HyPA-RAG enhances retrieval
accuracy, response fidelity, and contextual precision, offering a robust and
adaptable solution for high-stakes legal and policy applications.
Authors' comments: NAACL 2025 Industry Track & EMNLP 2024 CustomNLP4U Workshop
Charles Constant, Santosh Bhattarai, Indigo Brownhall, Anasuya Aruliah, Marek Ziebart
We present a methodology to generate low-latency, high spatio-temporal
resolution thermospheric density estimates using publicly available Low Earth
Orbit (LEO) spacecraft ephemerides. This provides a means of generating density
estimates that can be used in a data-assimilative context by the satellite
operations and thermosphere communities. It also contributes to the data base
of high-resolution density estimates during geomagnetic storms -- which remains
one of the major gaps for the development and benchmarking of density models.
Using accelerometer-derived densities from the Gravity Recovery And Climate
Experiment Follow-On (GRACE-FO) spacecraft as truth, our method surpasses
Energy Dissipation Rate-Type density retrieval techniques and three widely used
operational density models in terms of accuracy: EDR (103.37%), JB2008
(85.43%), DTM2000 (52.73%), and NRLMSISE-00 (12.31%). We demonstrate the
robustness of our methodology during a critical time for spacecraft operators
-- attempting to operate in the presence of geomagnetic storms, by
reconstructing density profiles along the orbits of three LEO satellites during
80 geomagnetic storms. These profiles exhibit high spatial and temporal
resolution compared to three operational thermospheric models, highlighting the
operational applicability and potential for their use in model validation. Our
findings suggest that the increasing availability of precise orbit
determination data offers a valuable, yet underutilized, resource that could
provide a significant improvement to data assimilative thermospheric models,
ultimately enhancing both spacecraft operations and thermospheric modeling
efforts.
Authors' comments: 29 pages, 6 figures
N. E. Kriman
The use of large language models (LLMs) has significantly increased since the
introduction of ChatGPT in 2022, demonstrating their value across various
applications. However, a major challenge for enterprise and commercial adoption
of LLMs is their tendency to generate inaccurate information, a phenomenon
known as "hallucination." This project proposes a method for estimating the
factuality of a summary generated by LLMs when compared to a source text. Our
approach utilizes Naive Bayes classification to assess the accuracy of the
content produced.
Authors' comments: 12 pages
Melisa Russak, Umar Jamil, Christopher Bryant, Kiran Kamble, Axel Magnuson, Mateusz Russak, Waseem AlShikh
In this paper, we introduce Writing in the Margins (WiM), a new inference pattern for Large Language Models designed to optimize the handling of long input sequences in retrieval-oriented tasks. This approach leverages the chunked prefill of the key-value cache to perform segment-wise inference, which enables efficient processing of extensive contexts along with the generation and classification of intermediate information ("margins") that guide the model towards specific tasks. This method increases computational overhead marginally while significantly enhancing the performance of off-the-shelf models without the need for fine-tuning. Specifically, we observe that WiM provides an average enhancement of 7.5% in accuracy for reasoning skills (HotpotQA, MultiHop-RAG) and more than a 30.0% increase in the F1-score for aggregation tasks (CWE). Additionally, we show how the proposed pattern fits into an interactive retrieval design that provides end-users with ongoing updates about the progress of context processing, and pinpoints the integration of relevant information into the final response. We release our implementation of WiM using Hugging Face Transformers library at https://github.com/writer/writing-in-the-margins.
Xin Liu, Shijie Tu, Yiwen Hu, Yifan Peng, Yubing Han, Cuifang Kuang, Xu Liu, Xiang Hao
Tightly focused optical fields are essential in nano-optics, but their
applications have been limited by the challenges of accurate yet efficient
characterization. In this article, we develop an in situ method for
reconstructing the fully vectorial information of tightly focused fields in
three-dimensional (3D) space, while simultaneously retrieving the pupil
functions. Our approach encodes these fields using phase-modulated focusing and
polarization-split detection, followed by decoding through an algorithm based
on least-sampling matrix-based Fourier transform and analytically derived
gradient. We further employ a focus scanning strategy. When combined with our
decoding algorithm, this strategy mitigates the imperfections in the detection
path. This approach requires only 10 frames of 2D measurements to realize
approximate 90% accuracy in tomography and pupil function retrieval within 10s.
Thus, it serves as a robust and convenient tool for the precise
characterization and optimization of light at the nanoscale. We apply this
technique to fully vectorial field manipulation, adaptive-optics-assisted
nanoscopy, and addressing mixed-state problems.
Authors' comments: 10 pages, 5 figures
Elona Shatri, George Fazekas
Optical Music Recognition (OMR) automates the transcription of musical
notation from images into machine-readable formats like MusicXML, MEI, or MIDI,
significantly reducing the costs and time of manual transcription. This study
explores knowledge discovery in OMR by applying instance segmentation using
Mask R-CNN to enhance the detection and delineation of musical symbols in sheet
music. Unlike Optical Character Recognition (OCR), OMR must handle the
intricate semantics of Common Western Music Notation (CWMN), where symbol
meanings depend on shape, position, and context. Our approach leverages
instance segmentation to manage the density and overlap of musical symbols,
facilitating more precise information retrieval from music scores. Evaluations
on the DoReMi and MUSCIMA++ datasets demonstrate substantial improvements, with
our method achieving a mean Average Precision (mAP) of up to 59.70\% in dense
symbol environments, achieving comparable results to object detection.
Furthermore, using traditional computer vision techniques, we add a parallel
step for staff detection to infer the pitch for the recognised symbols. This
study emphasises the role of pixel-wise segmentation in advancing accurate
music symbol recognition, contributing to knowledge discovery in OMR. Our
findings indicate that instance segmentation provides more precise
representations of musical symbols, particularly in densely populated scores,
advancing OMR technology. We make our implementation, pre-processing scripts,
trained models, and evaluation results publicly available to support further
research and development.
Authors' comments: 8 pages content and one references, accepted version at the
International Conference on Knowledge Discovery and Information Retrieval
2024, Porto, Portugal
Hao Jiang, Haoxiang Zhang, Qingshan Hou, Chaofeng Chen, Weisi Lin, Jingchang Zhang, Annan Wang
Providing high-quality item recall for text queries is crucial in large-scale e-commerce search systems. Current Embedding-based Retrieval Systems (ERS) embed queries and items into a shared low-dimensional space, but uni-modality ERS rely too heavily on textual features, making them unreliable in complex contexts. While multi-modality ERS incorporate various data sources, they often overlook individual preferences for different modalities, leading to suboptimal results. To address these issues, we propose MRSE, a Multi-modality Retrieval System that integrates text, item images, and user preferences through lightweight mixture-of-expert (LMoE) modules to better align features across and within modalities. MRSE also builds user profiles at a multi-modality level and introduces a novel hybrid loss function that enhances consistency and robustness using hard negative sampling. Experiments on a large-scale dataset from Shopee and online A/B testing show that MRSE achieves an 18.9% improvement in offline relevance and a 3.7% gain in online core metrics compared to Shopee's state-of-the-art uni-modality system.
Tianqi Wei, Zhi Chen, Xin Yu
Plant disease recognition is a critical task that ensures crop health and mitigates the damage caused by diseases. A handy tool that enables farmers to receive a diagnosis based on query pictures or the text description of suspicious plants is in high demand for initiating treatment before potential diseases spread further. In this paper, we develop a multimodal plant disease image retrieval system to support disease search based on either image or text prompts. Specifically, we utilize the largest in-the-wild plant disease dataset PlantWild, which includes over 18,000 images across 89 categories, to provide a comprehensive view of potential diseases relating to the query. Furthermore, cross-modal retrieval is achieved in the developed system, facilitated by a novel CLIP-based vision-language model that encodes both disease descriptions and disease images into the same latent space. Built on top of the retriever, our retrieval system allows users to upload either plant disease images or disease descriptions to retrieve the corresponding images with similar characteristics from the disease dataset to suggest candidate diseases for end users' consideration.
Yingqiang Gao, Jhony Prada, Nianlong Gu, Jessica Lam, Richard H. R. Hahnloser
Large Language Models (LLMs) produce eloquent texts but often the content they generate needs to be verified. Traditional information retrieval systems can assist with this task, but most systems have not been designed with LLM-generated queries in mind. As such, there is a compelling need for integrated systems that provide both retrieval and generation functionality within a single user interface. We present MODOC, a modular user interface that leverages the capabilities of LLMs and provides assistance with detecting their confabulations, promoting integrity in scientific writing. MODOC represents a significant step forward in scientific writing assistance. Its modular architecture supports flexible functions for retrieving information and for writing and generating text in a single, user-friendly interface.