Laura Caspari, Kanishka Ghosh Dastidar, Saber Zerhoudi, Jelena Mitrovic, Michael Granitzer
The choice of embedding model is a crucial step in the design of Retrieval Augmented Generation (RAG) systems. Given the sheer volume of available options, identifying clusters of similar models streamlines this model selection process. Relying solely on benchmark performance scores only allows for a weak assessment of model similarity. Thus, in this study, we evaluate the similarity of embedding models within the context of RAG systems. Our assessment is two-fold: We use Centered Kernel Alignment to compare embeddings on a pair-wise level. Additionally, as it is especially pertinent to RAG systems, we evaluate the similarity of retrieval results between these models using Jaccard and rank similarity. We compare different families of embedding models, including proprietary ones, across five datasets from the popular Benchmark Information Retrieval (BEIR). Through our experiments we identify clusters of models corresponding to model families, but interestingly, also some inter-family clusters. Furthermore, our analysis of top-k retrieval similarity reveals high-variance at low k values. We also identify possible open-source alternatives to proprietary models, with Mistral exhibiting the highest similarity to OpenAI models.
Yuxin Chen, Zongyang Ma, Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Ying Shan et al.
Dominant dual-encoder models enable efficient image-text retrieval but suffer
from limited accuracy while the cross-encoder models offer higher accuracy at
the expense of efficiency. Distilling cross-modality matching knowledge from
cross-encoder to dual-encoder provides a natural approach to harness their
strengths. Thus we investigate the following valuable question: how to make
cross-encoder a good teacher for dual-encoder? Our findings are threefold:(1)
Cross-modal similarity score distribution of cross-encoder is more concentrated
while the result of dual-encoder is nearly normal making vanilla logit
distillation less effective. However ranking distillation remains practical as
it is not affected by the score distribution.(2) Only the relative order
between hard negatives conveys valid knowledge while the order information
between easy negatives has little significance.(3) Maintaining the coordination
between distillation loss and dual-encoder training loss is beneficial for
knowledge transfer. Based on these findings we propose a novel Contrastive
Partial Ranking Distillation (CPRD) method which implements the objective of
mimicking relative order between hard negative samples with contrastive
learning. This approach coordinates with the training of the dual-encoder
effectively transferring valid knowledge from the cross-encoder to the
dual-encoder. Extensive experiments on image-text retrieval and ranking tasks
show that our method surpasses other distillation methods and significantly
improves the accuracy of dual-encoder.
Authors' comments: Accepted by CVPR 2024
Yashwardhan Chaudhuri, Paridhi Mundra, Arnesh Batra, Orchid Chetia Phukan, Arun Balaji Buduru
Recognition and interpretation of bird vocalizations are pivotal in
ornithological research and ecological conservation efforts due to their
significance in understanding avian behaviour, performing habitat assessment
and judging ecological health. This paper presents an audio spectrogram-guided
classification framework called ASGIR for improved bird sound recognition and
information retrieval. Our work is accompanied by a simple-to-use, two-step
information retrieval system that uses geographical location and bird sounds to
localize and retrieve relevant bird information by scraping Wikipedia page
information of recognized birds. ASGIR offers a substantial performance on a
random subset of 51 classes of Xeno-Canto dataset Bird sounds from European
countries with a median of 100\% performance on F1, Precision and Sensitivity
metrics. Our code is available as follows:
https://github.com/MainSample1234/AS-GIR .
Authors' comments: Accepted to INTERSPEECH'24
Ekaterina Khramtsova, Teerapong Leelanupab, Shengyao Zhuang, Mahsa Baktashmotlagh, Guido Zuccon
In this demo we present a web-based application for selecting an effective
pre-trained dense retriever to use on a private collection. Our system,
DenseQuest, provides unsupervised selection and ranking capabilities to predict
the best dense retriever among a pool of available dense retrievers, tailored
to an uploaded target collection. DenseQuest implements a number of existing
approaches, including a recent, highly effective method powered by Large
Language Models (LLMs), which requires neither queries nor relevance judgments.
The system is designed to be intuitive and easy to use for those information
retrieval engineers and researchers who need to identify a general-purpose
dense retrieval model to encode or search a new private target collection. Our
demonstration illustrates conceptual architecture and the different use case
scenarios of the system implemented on the cloud, enabling universal access and
use. DenseQuest is available at https://densequest.ielab.io.
Authors' comments: SIGIR2024 demo paper
Renjie Liang, Li Li, Chongzhi Zhang, Jing Wang, Xizhou Zhu, Aixin Sun
In this paper, we propose the task of \textit{Ranked Video Moment Retrieval} (RVMR) to locate a ranked list of matching moments from a collection of videos, through queries in natural language. Although a few related tasks have been proposed and studied by CV, NLP, and IR communities, RVMR is the task that best reflects the practical setting of moment search. To facilitate research in RVMR, we develop the TVR-Ranking dataset, based on the raw videos and existing moment annotations provided in the TVR dataset. Our key contribution is the manual annotation of relevance levels for 94,442 query-moment pairs. We then develop the $NDCG@K, IoU\geq \mu$ evaluation metric for this new task and conduct experiments to evaluate three baseline models. Our experiments show that the new RVMR task brings new challenges to existing models and we believe this new dataset contributes to the research on multi-modality search. The dataset is available at \url{https://github.com/Ranking-VMR/TVR-Ranking}
Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng
Recent advances in neural information retrieval (IR) models have
significantly enhanced their effectiveness over various IR tasks. The
robustness of these models, essential for ensuring their reliability in
practice, has also garnered significant attention. With a wide array of
research on robust IR being proposed, we believe it is the opportune moment to
consolidate the current status, glean insights from existing methodologies, and
lay the groundwork for future development. We view the robustness of IR to be a
multifaceted concept, emphasizing its necessity against adversarial attacks,
out-of-distribution (OOD) scenarios and performance variance. With a focus on
adversarial and OOD robustness, we dissect robustness solutions for dense
retrieval models (DRMs) and neural ranking models (NRMs), respectively,
recognizing them as pivotal components of the neural IR pipeline. We provide an
in-depth discussion of existing methods, datasets, and evaluation metrics,
shedding light on challenges and future directions in the era of large language
models. To the best of our knowledge, this is the first comprehensive survey on
the robustness of neural IR models, and we will also be giving our first
tutorial presentation at SIGIR 2024
\url{https://sigir2024-robust-information-retrieval.github.io}. Along with the
organization of existing work, we introduce a Benchmark for robust IR (BestIR),
a heterogeneous evaluation benchmark for robust neural information retrieval,
which is publicly available at \url{https://github.com/Davion-Liu/BestIR}. We
hope that this study provides useful clues for future research on the
robustness of IR models and helps to develop trustworthy search engines
\url{https://github.com/Davion-Liu/Awesome-Robustness-in-Information-Retrieval}.
Authors' comments: Survey paper
Anum Afzal, Alexander Kowsik, Rajna Fani, Florian Matthes
Large Language Models have found application in various mundane and repetitive tasks including Human Resource (HR) support. We worked with the domain experts of SAP SE to develop an HR support chatbot as an efficient and effective tool for addressing employee inquiries. We inserted a human-in-the-loop in various parts of the development cycles such as dataset collection, prompt optimization, and evaluation of generated output. By enhancing the LLM-driven chatbot's response quality and exploring alternative retrieval methods, we have created an efficient, scalable, and flexible tool for HR professionals to address employee inquiries effectively. Our experiments and evaluation conclude that GPT-4 outperforms other models and can overcome inconsistencies in data through internal reasoning capabilities. Additionally, through expert analysis, we infer that reference-free evaluation metrics such as G-Eval and Prometheus demonstrate reliability closely aligned with that of human evaluation.
Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, Yue Wang
This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation, dubbed RAM, featuring generalizability across various objects, environments, and embodiments. Unlike existing approaches that learn manipulation from expensive in-domain demonstrations, RAM capitalizes on a retrieval-based affordance transfer paradigm to acquire versatile manipulation capabilities from abundant out-of-domain data. First, RAM extracts unified affordance at scale from diverse sources of demonstrations including robotic data, human-object interaction (HOI) data, and custom data to construct a comprehensive affordance memory. Then given a language instruction, RAM hierarchically retrieves the most similar demonstration from the affordance memory and transfers such out-of-domain 2D affordance to in-domain 3D executable affordance in a zero-shot and embodiment-agnostic manner. Extensive simulation and real-world evaluations demonstrate that our RAM consistently outperforms existing works in diverse daily tasks. Additionally, RAM shows significant potential for downstream applications such as automatic and efficient data collection, one-shot visual imitation, and LLM/VLM-integrated long-horizon manipulation. For more details, please check our website at https://yxkryptonite.github.io/RAM/.
Hang Gao, Yongfeng Zhang
Vector retrieval algorithms are vital for semantic queries in the evolving landscape of Large Language Models (LLMs). Retrieving vectors that simultaneously meet criteria for both similarity and diversity significantly enhances the capabilities of LLM-based agents. Despite the widespread use of the Maximal Marginal Relevance (MMR) in retrieval scenarios with relevance and diversity requirements, fluctuations caused by variations in the parameter $ \lambda $ within the MMR complicate the determination of the optimization trajectory in vector spaces, thus obscuring the direction of enhancement. Moreover, there is a lack of a robust theoretical analysis for the constraints of similarity and diversity in retrieval processes. This paper introduces a novel approach to characterizing both constraints through the relationship between the sum vector and the query vector. The proximity of these vectors addresses the similarity constraint, while necessitating that individual vectors within the sum vector divergently align with the query vector to satisfy the diversity constraint. We also formulate a new combinatorial optimization challenge, taking a selection of $k$ vectors from a set of candidates such that their sum vector maximally aligns with the query vector, a problem we demonstrate to be NP-complete. This establishes the profound difficulty of pursuing similarity and diversity simultaneously in vector retrieval and lays a theoretical groundwork for further research. Additionally, we present the heuristic algorithm Vectors Retrieval with Similarity and Diversity (VRSD) which not only has a definitive optimization goal and eschews the need for preset parameters but also offers a modest reduction in time complexity compared to MMR. Empirical validation further confirm that VRSD significantly surpasses MMR across various datasets.
Aleksander Ficek, Jiaqi Zeng, Oleksii Kuchaiev
Parameter-Efficient Fine-Tuning (PEFT) and Retrieval-Augmented Generation
(RAG) have become popular methods for adapting large language models while
minimizing compute requirements. In this paper, we apply PEFT methods
(P-tuning, Adapters, and LoRA) to a modified Retrieval-Enhanced Transformer
(RETRO) and a baseline GPT model across several sizes, ranging from 823 million
to 48 billion parameters. We show that RETRO models outperform GPT models in
zero-shot settings due to their unique pre-training process but GPT models have
higher performance potential with PEFT. Additionally, our study indicates that
8B parameter models strike an optimal balance between cost and performance and
P-tuning lags behind other PEFT techniques. We further provide a comparative
analysis between applying PEFT to an Instruction-tuned RETRO model and base
RETRO model. This work presents the first comprehensive comparison of various
PEFT methods integrated with RAG, applied to both GPT and RETRO models,
highlighting their relative performance.
Authors' comments: EMNLP 2024
Chenglei Shen, Xiao Zhang, Teng Shi, Changshuo Zhang, Guofu Xie, Jun Xu
Controllable learning (CL) emerges as a critical component in trustworthy machine learning, ensuring that learners meet predefined targets and can adaptively adjust without retraining according to the changes in those targets. We provide a formal definition of CL, and discuss its applications in information retrieval (IR) where information needs are often complex and dynamic. The survey categorizes CL according to who controls (users or platforms), what is controllable (e.g., retrieval objectives, users' historical behaviors, controllable environmental adaptation), how control is implemented (e.g., rule-based method, Pareto optimization, Hypernetwork), and where to implement control (e.g.,pre-processing, in-processing, post-processing methods). Then, we identify challenges faced by CL across training, evaluation, task setting, and deployment in online environments. Additionally, we outline promising directions for CL in theoretical analysis, efficient computation, empowering large language models, application scenarios and evaluation frameworks in IR.
Mengzhao Wang, Haotian Wu, Xiangyu Ke, Yunjun Gao, Xiaoliang Xu, Lu Chen
Retrieval-augmented Large Language Models (LLMs) have reshaped traditional
query-answering systems, offering unparalleled user experiences. However,
existing retrieval techniques often struggle to handle multi-modal query
contexts. In this paper, we present an interactive Multi-modal Query Answering
(MQA) system, empowered by our newly developed multi-modal retrieval framework
and navigation graph index, integrated with cutting-edge LLMs. It comprises
five core components: Data Preprocessing, Vector Representation, Index
Construction, Query Execution, and Answer Generation, all orchestrated by a
dedicated coordinator to ensure smooth data flow from input to answer
generation. One notable aspect of MQA is its utilization of contrastive
learning to assess the significance of different modalities, facilitating
precise measurement of multi-modal information similarity. Furthermore, the
system achieves efficient retrieval through our advanced navigation graph
index, refined using computational pruning techniques. Another highlight of our
system is its pluggable processing framework, allowing seamless integration of
embedding models, graph indexes, and LLMs. This flexibility provides users
diverse options for gaining insights from their multi-modal knowledge base. A
preliminary video introduction of MQA is available at
https://youtu.be/xvUuo2ZIqWk.
Authors' comments: This demo paper has been accepted by VLDB 2024
Divya Kumawat, Ardeshir Ebtehaj, Xiaolan Xu, Andreas Colliander, Vipin Kumar
Estimating the landscape and soil freeze-thaw (FT) dynamics in the Northern Hemisphere is crucial for understanding permafrost response to global warming and changes in regional and global carbon budgets. A new framework is presented for surface FT-cycle retrievals using L-band microwave radiometry based on a deep convolutional autoencoder neural network. This framework defines the landscape FT-cycle retrieval as a time series anomaly detection problem considering the frozen states as normal and thawed states as anomalies. The autoencoder retrieves the FT-cycle probabilistically through supervised reconstruction of the brightness temperature (TB) time series using a contrastive loss function that minimizes (maximizes) the reconstruction error for the peak winter (summer). Using the data provided by the Soil Moisture Active Passive (SMAP) satellite, it is demonstrated that the framework learns to isolate the landscape FT states over different land surface types with varying complexities related to the radiometric characteristics of snow cover, lake-ice phenology, and vegetation canopy. The consistency of the retrievals is evaluated over Alaska, against in situ ground-based observations, showing reduced uncertainties compared to the traditional methods that use thresholding of the normalized polarization ratio.
Rui Yang
This paper presents CaseGPT, an innovative approach that combines Large
Language Models (LLMs) and Retrieval-Augmented Generation (RAG) technology to
enhance case-based reasoning in the healthcare and legal sectors. The system
addresses the challenges of traditional database queries by enabling fuzzy
searches based on imprecise descriptions, thereby improving data searchability
and usability. CaseGPT not only retrieves relevant case data but also generates
insightful suggestions and recommendations based on patterns discerned from
existing case data. This functionality proves especially valuable for tasks
such as medical diagnostics, legal precedent research, and case strategy
formulation. The paper includes an in-depth discussion of the system's
methodology, its performance in both medical and legal domains, and its
potential for future applications. Our experiments demonstrate that CaseGPT
significantly outperforms traditional keyword-based and simple LLM-based
systems in terms of precision, recall, and efficiency.
Authors' comments: Submitted to ICCBR
Mainak Singha, Ankit Jha, Divyam Gupta, Pranav Singla, Biplab Banerjee
We address the challenges inherent in sketch-based image retrieval (SBIR)
across various settings, including zero-shot SBIR, generalized zero-shot SBIR,
and fine-grained zero-shot SBIR, by leveraging the vision-language foundation
model CLIP. While recent endeavors have employed CLIP to enhance SBIR, these
approaches predominantly follow uni-modal prompt processing and overlook to
exploit CLIP's integrated visual and textual capabilities fully. To bridge this
gap, we introduce SpLIP, a novel multi-modal prompt learning scheme designed to
operate effectively with frozen CLIP backbones. We diverge from existing
multi-modal prompting methods that treat visual and textual prompts
independently or integrate them in a limited fashion, leading to suboptimal
generalization. SpLIP implements a bi-directional prompt-sharing strategy that
enables mutual knowledge exchange between CLIP's visual and textual encoders,
fostering a more cohesive and synergistic prompt processing mechanism that
significantly reduces the semantic gap between the sketch and photo embeddings.
In addition to pioneering multi-modal prompt learning, we propose two
innovative strategies for further refining the embedding space. The first is an
adaptive margin generation for the sketch-photo triplet loss, regulated by
CLIP's class textual embeddings. The second introduces a novel task, termed
conditional cross-modal jigsaw, aimed at enhancing fine-grained sketch-photo
alignment by implicitly modeling sketches' viable patch arrangement using
knowledge of unshuffled photos. Our comprehensive experimental evaluations
across multiple benchmarks demonstrate the superior performance of SpLIP in all
three SBIR scenarios. Project page: https://mainaksingha01.github.io/SpLIP/ .
Authors' comments: Accepted in ECCV 2024
Taeho Hwang, Soyeong Jeong, Sukmin Cho, SeungYoon Han, Jong C. Park
Recent advancements in Large Language Models (LLMs) have significantly
improved their performance across various Natural Language Processing (NLP)
tasks. However, LLMs still struggle with generating non-factual responses due
to limitations in their parametric memory. Retrieval-Augmented Generation (RAG)
systems address this issue by incorporating external knowledge with a retrieval
module. Despite their successes, however, current RAG systems face challenges
with retrieval failures and the limited ability of LLMs to filter out
irrelevant information. Therefore, in this work, we propose DSLR (Document
Refinement with Sentence-Level Re-ranking and Reconstruction), an unsupervised
framework that decomposes retrieved documents into sentences, filters out
irrelevant sentences, and reconstructs them again into coherent passages. We
experimentally validate DSLR on multiple open-domain QA datasets and the
results demonstrate that DSLR significantly enhances the RAG performance over
conventional fixed-size passage. Furthermore, our DSLR enhances performance in
specific, yet realistic scenarios without the need for additional training,
providing an effective and efficient solution for refining retrieved documents
in RAG systems.
Authors' comments: 20 pages
Yu Zhao, Ying Zhang, Baohang Zhou, Xinying Qian, Kehui Song, Xiangrui Cai
A large number of studies have emerged for Multimodal Knowledge Graph
Completion (MKGC) to predict the missing links in MKGs. However, fewer studies
have been proposed to study the inductive MKGC (IMKGC) involving emerging
entities unseen during training. Existing inductive approaches focus on
learning textual entity representations, which neglect rich semantic
information in visual modality. Moreover, they focus on aggregating structural
neighbors from existing KGs, which of emerging entities are usually limited.
However, the semantic neighbors are decoupled from the topology linkage and
usually imply the true target entity. In this paper, we propose the IMKGC task
and a semantic neighbor retrieval-enhanced IMKGC framework CMR, where the
contrast brings the helpful semantic neighbors close, and then the memorize
supports semantic neighbor retrieval to enhance inference. Specifically, we
first propose a unified cross-modal contrastive learning to simultaneously
capture the textual-visual and textual-textual correlations of query-entity
pairs in a unified representation space. The contrastive learning increases the
similarity of positive query-entity pairs, therefore making the representations
of helpful semantic neighbors close. Then, we explicitly memorize the knowledge
representations to support the semantic neighbor retrieval. At test time, we
retrieve the nearest semantic neighbors and interpolate them to the
query-entity similarity distribution to augment the final prediction. Extensive
experiments validate the effectiveness of CMR on three inductive MKGC datasets.
Codes are available at https://github.com/OreOZhao/CMR.
Authors' comments: Accepted by SIGIR 2024
Nastaran Bassamzadeh, Chhaya Methani
Natural Language to Code Generation has made significant progress in recent
years with the advent of Large Language Models(LLMs). While generation for
general-purpose languages like C, C++, and Python has improved significantly,
LLMs struggle with custom function names in Domain Specific Languages or DSLs.
This leads to higher hallucination rates and syntax errors, specially for DSLs
having a high number of custom function names. Additionally, constant updates
to function names add to the challenge as LLMs need to stay up-to-date. In this
paper, we present optimizations for using Retrieval Augmented Generation (or
RAG) with LLMs for DSL generation along with an ablation study comparing these
strategies. We generated a train as well as test dataset with a DSL to
represent automation tasks across roughly 700 APIs in public domain. We used
the training dataset to fine-tune a Codex model for this DSL. Our results
showed that the fine-tuned model scored the best on code similarity metric.
With our RAG optimizations, we achieved parity for similarity metric. The
compilation rate, however, showed that both the models still got the syntax
wrong many times, with RAG-based method being 2 pts better. Conversely,
hallucination rate for RAG model lagged by 1 pt for API names and by 2 pts for
API parameter keys. We conclude that an optimized RAG model can match the
quality of fine-tuned models and offer advantages for new, unseen APIs.
Authors' comments: 8 pages, 1 figure
Kazuaki Furumai, Roberto Legaspi, Julio Vizcarra, Yudai Yamazaki, Yasutaka Nishimura, Sina J. Semnani, Kazushi Ikeda, Weiyan Shi et al.
Persuasion plays a pivotal role in a wide range of applications from health intervention to the promotion of social good. Persuasive chatbots can accelerate the positive effects of persuasion in such applications. Existing methods rely on fine-tuning persuasive chatbots with task-specific training data which is costly, if not infeasible, to collect. To address this issue, we propose a method to leverage the generalizability and inherent persuasive abilities of large language models (LLMs) in creating effective and truthful persuasive chatbot for any given domain in a zero-shot manner. Unlike previous studies which used pre-defined persuasion strategies, our method first uses an LLM to generate responses, then extracts the strategies used on the fly, and replaces any unsubstantiated claims in the response with retrieved facts supporting the strategies. We applied our chatbot, PersuaBot, to three significantly different domains needing persuasion skills: donation solicitation, recommendations, and health intervention. Our experiments on simulated and human conversations show that our zero-shot approach is more persuasive than prior work, while achieving factual accuracy surpassing state-of-the-art knowledge-oriented chatbots. Our study demonstrated that when persuasive chatbots are employed responsibly for social good, it is an enabler of positive individual and social change.
Zhili Shen, Pavlos Vougiouklis, Chenxin Diao, Kaustubh Vyas, Yuanyi Ji, Jeff Z. Pan
We focus on Text-to-SQL semantic parsing from the perspective of
retrieval-augmented generation. Motivated by challenges related to the size of
commercial database schemata and the deployability of business intelligence
solutions, we propose $\text{ASTReS}$ that dynamically retrieves input database
information and uses abstract syntax trees to select few-shot examples for
in-context learning.
Furthermore, we investigate the extent to which an in-parallel semantic
parser can be leveraged for generating approximated versions of the expected
SQL queries, to support our retrieval. We take this approach to the extreme--we
adapt a model consisting of less than $500$M parameters, to act as an extremely
efficient approximator, enhancing it with the ability to process schemata in a
parallelised manner. We apply $\text{ASTReS}$ to monolingual and cross-lingual
benchmarks for semantic parsing, showing improvements over state-of-the-art
baselines. Comprehensive experiments highlight the contribution of modules
involved in this retrieval-augmented generation setting, revealing interesting
directions for future work.
Authors' comments: EMNLP 2024 Main