Yun Zhu, Jia-Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng Lin, Lei Shu, Liangchen Luo et al.
Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility by incorporating external contexts. However, the input length grows linearly in the number of retrieved documents, causing a dramatic increase in latency. In this paper, we propose a novel paradigm named Sparse RAG, which seeks to cut computation costs through sparsity. Specifically, Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents. Then, LLMs selectively decode the output by only attending to highly relevant caches auto-regressively, which are chosen via prompting LLMs with special control tokens. It is notable that Sparse RAG combines the assessment of each individual document and the generation of the response into a single process. The designed sparse mechanism in a RAG system can facilitate the reduction of the number of documents loaded during decoding for accelerating the inference of the RAG system. Additionally, filtering out undesirable contexts enhances the model's focus on relevant context, inherently improving its generation quality. Evaluation results of two datasets show that Sparse RAG can strike an optimal balance between generation quality and computational efficiency, demonstrating its generalizability across both short- and long-form generation tasks.
Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, Ji-Rong Wen
Recently, integrating external tools with Large Language Models (LLMs) has
gained significant attention as an effective strategy to mitigate the
limitations inherent in their pre-training data. However, real-world systems
often incorporate a wide array of tools, making it impractical to input all
tools into LLMs due to length limitations and latency constraints. Therefore,
to fully exploit the potential of tool-augmented LLMs, it is crucial to develop
an effective tool retrieval system. Existing tool retrieval methods primarily
focus on semantic matching between user queries and tool descriptions,
frequently leading to the retrieval of redundant, similar tools. Consequently,
these methods fail to provide a complete set of diverse tools necessary for
addressing the multifaceted problems encountered by LLMs. In this paper, we
propose a novel modelagnostic COllaborative Learning-based Tool Retrieval
approach, COLT, which captures not only the semantic similarities between user
queries and tool descriptions but also takes into account the collaborative
information of tools. Specifically, we first fine-tune the PLM-based retrieval
models to capture the semantic relationships between queries and tools in the
semantic learning stage. Subsequently, we construct three bipartite graphs
among queries, scenes, and tools and introduce a dual-view graph collaborative
learning framework to capture the intricate collaborative relationships among
tools during the collaborative learning stage. Extensive experiments on both
the open benchmark and the newly introduced ToolLens dataset show that COLT
achieves superior performance. Notably, the performance of BERT-mini (11M) with
our proposed model framework outperforms BERT-large (340M), which has 30 times
more parameters. Furthermore, we will release ToolLens publicly to facilitate
future research on tool retrieval.
Authors' comments: Accepted by CIKM 2024; GitHub: https://github.com/quchangle1/COLT
Yiming Wu, Hangfei Li, Fangfang Wang, Yilong Zhang, Ronghua Liang
In the domain of language-based fashion image retrieval, pinpointing the
desired fashion item using both a reference image and its accompanying textual
description is an intriguing challenge. Existing approaches lean heavily on
static fusion techniques, intertwining image and text. Despite their
commendable advancements, these approaches are still limited by a deficiency in
flexibility. In response, we propose a Self-distilled Dynamic Fusion Network to
compose the multi-granularity features dynamically by considering the
consistency of routing path and modality-specific information simultaneously.
Two new modules are included in our proposed method: (1) Dynamic Fusion Network
with Modality Specific Routers. The dynamic network enables a flexible
determination of the routing for each reference image and modification text,
taking into account their distinct semantics and distributions. (2) Self Path
Distillation Loss. A stable path decision for queries benefits the optimization
of feature extraction as well as routing, and we approach this by progressively
refine the path decision with previous path information. Extensive experiments
demonstrate the effectiveness of our proposed model compared to existing
methods.
Authors' comments: ICASSP 2024
Zhongnian Li, Jinghao Xu, Peng Ying, Meng Wei, Xinzheng Xu
Pre-trained Vision-Language Models (VLMs) exhibit strong zero-shot
classification abilities, demonstrating great potential for generating weakly
supervised labels. Unfortunately, existing weakly supervised learning methods
are short of ability in generating accurate labels via VLMs. In this paper, we
propose a novel weakly supervised labeling setting, namely True-False Labels
(TFLs) which can achieve high accuracy when generated by VLMs. The TFL
indicates whether an instance belongs to the label, which is randomly and
uniformly sampled from the candidate label set. Specifically, we theoretically
derive a risk-consistent estimator to explore and utilize the conditional
probability distribution information of TFLs. Besides, we propose a
convolutional-based Multi-modal Prompt Retrieving (MRP) method to bridge the
gap between the knowledge of VLMs and target learning tasks. Experimental
results demonstrate the effectiveness of the proposed TFL setting and MRP
learning method. The code to reproduce the experiments is at
https://github.com/Tranquilxu/TMP.
Authors' comments: 15 pages, 5 figures
Yuxuan Liu, Tianchi Yang, Zihan Zhang, Minghui Song, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang
Generative retrieval, a promising new paradigm in information retrieval, employs a seq2seq model to encode document features into parameters and decode relevant document identifiers (IDs) based on search queries. Existing generative retrieval solutions typically rely on a preprocessing stage to pre-define document IDs, which can suffer from a semantic gap between these IDs and the retrieval task. However, end-to-end training for both ID assignments and retrieval tasks is challenging due to the long-tailed distribution characteristics of real-world data, resulting in inefficient and unbalanced ID space utilization. To address these issues, we propose ASI++, a novel fully end-to-end generative retrieval method that aims to simultaneously learn balanced ID assignments and improve retrieval performance. ASI++ builds on the fully end-to-end training framework of vanilla ASI and introduces several key innovations. First, a distributionally balanced criterion addresses the imbalance in ID assignments, promoting more efficient utilization of the ID space. Next, a representation bottleneck criterion enhances dense representations to alleviate bottlenecks in learning ID assignments. Finally, an information consistency criterion integrates these processes into a joint optimization framework grounded in information theory. We further explore various module structures for learning ID assignments, including neural quantization, differentiable product quantization, and residual quantization. Extensive experiments on both public and industrial datasets demonstrate the effectiveness of ASI++ in improving retrieval performance and achieving balanced ID assignments.
Jiajie Jin, Yutao Zhu, Guanting Dong, Yuyao Zhang, Xinyu Yang, Chenghao Zhang, Tong Zhao, Zhao Yang et al.
With the advent of large language models (LLMs) and multimodal large language
models (MLLMs), the potential of retrieval-augmented generation (RAG) has
attracted considerable research attention. Various novel algorithms and models
have been introduced to enhance different aspects of RAG systems. However, the
absence of a standardized framework for implementation, coupled with the
inherently complex RAG process, makes it challenging and time-consuming for
researchers to compare and evaluate these approaches in a consistent
environment. Existing RAG toolkits, such as LangChain and LlamaIndex, while
available, are often heavy and inflexibly, failing to meet the customization
needs of researchers. In response to this challenge, we develop \ours{}, an
efficient and modular open-source toolkit designed to assist researchers in
reproducing and comparing existing RAG methods and developing their own
algorithms within a unified framework. Our toolkit has implemented 16 advanced
RAG methods and gathered and organized 38 benchmark datasets. It has various
features, including a customizable modular framework, multimodal RAG
capabilities, a rich collection of pre-implemented RAG works, comprehensive
datasets, efficient auxiliary pre-processing scripts, and extensive and
standard evaluation metrics. Our toolkit and resources are available at
https://github.com/RUC-NLPIR/FlashRAG.
Authors' comments: The paper is accepted by WWW2025 Resource Track
S. Winning, M. Lietzow-Sinjen, S. Wolf
Context. As a new growing field, exocartography aims to map the surface
features of exoplanets that are beyond the resolution of traditional observing
techniques. While photometric approaches have been discussed extensively,
polarimetry has received less attention despite its promising prospects.
Aims. We demonstrate that the limb polarization of an exoplanetary atmosphere
offers valuable insights into its cloud cover distribution. Specifically, we
determine an upper limit for the polarimetric precision, which is required to
extract information about the latitudinal cloud cover of temperate Jovian
planets for scenarios of observations with and without host stars.
Methods. To compute the scattered stellar radiation of an exoplanetary
atmosphere and to study the polarization at various planetary phase angles, we
used the three-dimensional Monte Carlo radiative transfer code POLARIS.
Results. When the planetary signal can be measured separately from the
stellar radiation, information about the latitudinal cloud cover for polar cap
models is accessible at polarimetric sensitivities of $0.1$ %. In contrast, a
precision of about $10^{-3}$ ppm is required when the stellar flux is included
to gain this information.
Authors' comments: Accepted for publication in Astronomy & Astrophysics. 8 pages, 5
figures
Weijia Liu, Bo Miao, Jiuxin Cao, Xuelin Zhu, Bo Liu, Mehwish Nasim, Ajmal Mian
Current methods for Video Moment Retrieval (VMR) struggle to align complex situations involving specific environmental details, character descriptions, and action narratives. To tackle this issue, we propose a Large Language Model-guided Moment Retrieval (LMR) approach that employs the extensive knowledge of Large Language Models (LLMs) to improve video context representation as well as cross-modal alignment, facilitating accurate localization of target moments. Specifically, LMR introduces a context enhancement technique with LLMs to generate crucial target-related context semantics. These semantics are integrated with visual features for producing discriminative video representations. Finally, a language-conditioned transformer is designed to decode free-form language queries, on the fly, using aligned video representations for moment retrieval. Extensive experiments demonstrate that LMR achieves state-of-the-art results, outperforming the nearest competitor by up to 3.28\% and 4.06\% on the challenging QVHighlights and Charades-STA benchmarks, respectively. More importantly, the performance gains are significantly higher for localization of complex queries.
Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Yihang Duan, Xinyu Lyu, Hengtao Shen
Adapting large-scale image-text pre-training models, e.g., CLIP, to the video
domain represents the current state-of-the-art for text-video retrieval. The
primary approaches involve transferring text-video pairs to a common embedding
space and leveraging cross-modal interactions on specific entities for semantic
alignment. Though effective, these paradigms entail prohibitive computational
costs, leading to inefficient retrieval. To address this, we propose a simple
yet effective method, Global-Local Semantic Consistent Learning (GLSCL), which
capitalizes on latent shared semantics across modalities for text-video
retrieval. Specifically, we introduce a parameter-free global interaction
module to explore coarse-grained alignment. Then, we devise a shared local
interaction module that employs several learnable queries to capture latent
semantic concepts for learning fine-grained alignment. Furthermore, an
Inter-Consistency Loss (ICL) is devised to accomplish the concept alignment
between the visual query and corresponding textual query, and an
Intra-Diversity Loss (IDL) is developed to repulse the distribution within
visual (textual) queries to generate more discriminative concepts. Extensive
experiments on five widely used benchmarks (i.e., MSR-VTT, MSVD, DiDeMo, LSMDC,
and ActivityNet) substantiate the superior effectiveness and efficiency of the
proposed method. Remarkably, our method achieves comparable performance with
SOTA as well as being nearly 220 times faster in terms of computational cost.
Code is available at: https://github.com/zchoi/GLSCL.
Authors' comments: The author has withdrawn this paper due to a critical definitional
error in concept learning for global/local-interaction learning during
training. This error led to an alignment issue with the definition of the
text-video retrieval task, causing an unfair comparison with state-of-the-art
(SOTA) methods. Consequently, this hindered the accurate evaluation of the
paper's contributions
Yuang Zhao, Zhaocheng Du, Qinglin Jia, Linxuan Zhang, Zhenhua Dong, Ruiming Tang
With the increase in the business scale and number of domains in online advertising, multi-domain ad recommendation has become a mainstream solution in the industry. The core of multi-domain recommendation is effectively modeling the commonalities and distinctions among domains. Existing works are dedicated to designing model architectures for implicit multi-domain modeling while overlooking an in-depth investigation from a more fundamental perspective of feature distributions. This paper focuses on features with significant differences across various domains in both distributions and effects on model predictions. We refer to these features as domain-sensitive features, which serve as carriers of domain distinctions and are crucial for multi-domain modeling. Experiments demonstrate that existing multi-domain modeling methods may neglect domain-sensitive features, indicating insufficient learning of domain distinctions. To avoid this neglect, we propose a domain-sensitive feature attribution method to identify features that best reflect domain distinctions from the feature set. Further, we design a memory architecture that extracts domain-specific information from domain-sensitive features for the model to retrieve and integrate, thereby enhancing the awareness of domain distinctions. Extensive offline and online experiments demonstrate the superiority of our method in capturing domain distinctions and improving multi-domain recommendation performance.
Vatsal Raina, Mark Gales
Enterprise retrieval augmented generation (RAG) offers a highly flexible
framework for combining powerful large language models (LLMs) with internal,
possibly temporally changing, documents. In RAG, documents are first chunked.
Relevant chunks are then retrieved for a user query, which are passed as
context to a synthesizer LLM to generate the query response. However, the
retrieval step can limit performance, as incorrect chunks can lead the
synthesizer LLM to generate a false response. This work applies a zero-shot
adaptation of standard dense retrieval steps for more accurate chunk recall.
Specifically, a chunk is first decomposed into atomic statements. A set of
synthetic questions are then generated on these atoms (with the chunk as the
context). Dense retrieval involves finding the closest set of synthetic
questions, and associated chunks, to the user query. It is found that retrieval
with the atoms leads to higher recall than retrieval with chunks. Further
performance gain is observed with retrieval using the synthetic questions
generated over the atoms. Higher recall at the retrieval step enables higher
performance of the enterprise LLM using the RAG pipeline.
Authors' comments: 14 pages, 5 figures, 5 tables
Manh Luong, Khai Nguyen, Nhat Ho, Reza Haf, Dinh Phung, Lizhen Qu
The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Notably, our strategy of employing partial optimal transport with m-LTM demonstrates greater noise tolerance than contrastive loss, especially under varying noise ratios in training data on the AudioCaps dataset. Our code is available at https://github.com/v-manhlt3/m-LTM-Audio-Text-Retrieval
Sahel Sharifymoghaddam, Shivani Upadhyay, Wenhu Chen, Jimmy Lin
Recently, Large Vision Language Models (LVLMs) have unlocked many complex use
cases that require Multi-Modal (MM) understanding (e.g., image captioning or
visual question answering) and MM generation (e.g., text-guided image
generation or editing) capabilities. To further improve the output fidelityof
LVLMs we introduce UniRAG, a plug-and-play technique that adds relevant
retrieved information to prompts as few-shot examples during inference. Unlike
the common belief that Retrieval Augmentation (RA) mainly improves generation
or understanding of uncommon entities, our evaluation results on the MSCOCO
dataset with common entities show that both proprietary models like GPT-4o and
Gemini-Pro and smaller open-source models like LLaVA, LaVIT, and Emu2
significantly enhance their generation quality when their input prompts are
augmented with relevant information retrieved by Vision-Language (VL)
retrievers like UniIR models. All the necessary code to reproduce our results
is available at https://github.com/castorini/UniRAG
Authors' comments: 14 pages, 6 figures
Hanzhuo Tan, Qi Luo, Ling Jiang, Zizheng Zhan, Jing Li, Haotian Zhang, Yuqun Zhang
Automated code completion, aiming at generating subsequent tokens from unfinished code, has been significantly benefited from recent progress in pre-trained Large Language Models (LLMs). However, these models often suffer from coherence issues and hallucinations when dealing with complex code logic or extrapolating beyond their training data. Existing Retrieval Augmented Generation (RAG) techniques partially address these issues by retrieving relevant code with a separate encoding model where the retrieved snippet serves as contextual reference for code completion. However, their retrieval scope is subject to a singular perspective defined by the encoding model, which largely overlooks the complexity and diversity inherent in code semantics. To address this limitation, we propose ProCC, a code completion framework leveraging prompt engineering and the contextual multi-armed bandits algorithm to flexibly incorporate and adapt to multiple perspectives of code. ProCC first employs a prompt-based multi-retriever system which crafts prompt templates to elicit LLM knowledge to understand code semantics with multiple retrieval perspectives. Then, it adopts the adaptive retrieval selection algorithm to incorporate code similarity into the decision-making process to determine the most suitable retrieval perspective for the LLM to complete the code. Experimental results demonstrate that ProCC outperforms state-of-the-art code completion technique by 8.6% on our collected open-source benchmark suite and 10.1% on the private-domain benchmark suite collected from a billion-user e-commerce company in terms of Exact Match. ProCC also allows augmenting fine-tuned techniques in a plug-and-play manner, yielding 5.6% improvement over our studied fine-tuned model.
Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, Jaroslaw Szlichta
This paper demonstrates RAGE, an interactive tool for explaining Large
Language Models (LLMs) augmented with retrieval capabilities; i.e., able to
query external sources and pull relevant information into their input context.
Our explanations are counterfactual in the sense that they identify parts of
the input context that, when removed, change the answer to the question posed
to the LLM. RAGE includes pruning methods to navigate the vast space of
possible explanations, allowing users to view the provenance of the produced
answers.
Authors' comments: Accepted by ICDE 2024 (Demonstration Track)
Chris Samarinas, Hamed Zamani
The field of conversational information seeking, which is rapidly gaining interest in both academia and industry, is changing how we interact with search engines through natural language interactions. Existing datasets and methods are mostly evaluating reactive conversational information seeking systems that solely provide response to every query from the user. We identify a gap in building and evaluating proactive conversational information seeking systems that can monitor a multi-party human conversation and proactively engage in the conversation at an opportune moment by retrieving useful resources and suggestions. In this paper, we introduce a large-scale dataset for proactive document retrieval that consists of over 2.8 million conversations. We conduct crowdsourcing experiments to obtain high-quality and relatively complete relevance judgments through depth-k pooling. We also collect annotations related to the parts of the conversation that are related to each document, enabling us to evaluate proactive retrieval systems. We introduce normalized proactive discounted cumulative gain (npDCG) for evaluating these systems, and further provide benchmark results for a wide range of models, including a novel model we developed for this task. We believe that the developed dataset, called ProCIS, paves the path towards developing proactive conversational information seeking systems.
Mingzhu Wang, Yuzhe Zhang, Qihang Zhao, Juanyi Yang, Hong Zhang
Retrieval augmentation is critical when Language Models (LMs) exploit non-parametric knowledge related to the query through external knowledge bases before reasoning. The retrieved information is incorporated into LMs as context alongside the query, enhancing the reliability of responses towards factual questions. Prior researches in retrieval augmentation typically follow a retriever-generator paradigm. In this context, traditional retrievers encounter challenges in precisely and seamlessly extracting query-relevant information from knowledge bases. To address this issue, this paper introduces a novel retrieval augmentation framework called ChatLR that primarily employs the powerful semantic understanding ability of Large Language Models (LLMs) as retrievers to achieve precise and concise information retrieval. Additionally, we construct an LLM-based search and question answering system tailored for the financial domain by fine-tuning LLM on two tasks including Text2API and API-ID recognition. Experimental results demonstrate the effectiveness of ChatLR in addressing user queries, achieving an overall information retrieval accuracy exceeding 98.8\%.
Salam Albatarni, Sohaila Eltanbouly, Tamer Elsayed
Automated Essay Scoring automates the grading process of essays, providing a
great advantage for improving the writing proficiency of students. While
holistic essay scoring research is prevalent, a noticeable gap exists in
scoring essays for specific quality traits. In this work, we focus on the
relevance trait, which measures the ability of the student to stay on-topic
throughout the entire essay. We propose a novel approach for graded relevance
scoring of written essays that employs dense retrieval encoders. Dense
representations of essays at different relevance levels then form clusters in
the embeddings space, such that their centroids are potentially separate enough
to effectively represent their relevance levels. We hence use the simple
1-Nearest-Neighbor classification over those centroids to determine the
relevance level of an unseen essay. As an effective unsupervised dense encoder,
we leverage Contriever, which is pre-trained with contrastive learning and
demonstrated comparable performance to supervised dense retrieval models. We
tested our approach on both task-specific (i.e., training and testing on same
task) and cross-task (i.e., testing on unseen task) scenarios using the widely
used ASAP++ dataset. Our method establishes a new state-of-the-art performance
in the task-specific scenario, while its extension for the cross-task scenario
exhibited a performance that is on par with the state-of-the-art model for that
scenario. We also analyzed the performance of our approach in a more practical
few-shot scenario, showing that it can significantly reduce the labeling cost
while sacrificing only 10% of its effectiveness.
Authors' comments: Accepted at SIGIR 2024
Peiqin Lin, André F. T. Martins, Hinrich Schütze
Recent studies indicate that leveraging off-the-shelf or fine-tuned
retrievers, capable of retrieving relevant in-context examples tailored to the
input query, enhances few-shot in-context learning of English. However,
adapting these methods to other languages, especially low-resource ones, poses
challenges due to the scarcity of cross-lingual retrievers and annotated data.
Thus, we introduce XAMPLER: Cross-Lingual Example Retrieval, a method tailored
to tackle the challenge of cross-lingual in-context learning using only
annotated English data. XAMPLER first trains a retriever based on Glot500, a
multilingual small language model, using positive and negative English examples
constructed from the predictions of a multilingual large language model, i.e.,
MaLA500. Leveraging the cross-lingual capacity of the retriever, it can
directly retrieve English examples as few-shot examples for in-context learning
of target languages. Experiments on two multilingual text classification
benchmarks, namely SIB200 with 176 languages and MasakhaNEWS with 16 languages,
demonstrate that XAMPLER substantially improves the in-context learning
performance across languages. Our code is available at
https://github.com/cisnlp/XAMPLER.
Authors' comments: NAACL 2025 Findings
Nhat Tran, Diane Litman
Knowledge retrieval is one of the major challenges in building a
knowledge-grounded dialogue system. A common method is to use a neural
retriever with a distributed approximate nearest-neighbor database to quickly
find the relevant knowledge sentences. In this work, we propose an approach
that utilizes topic modeling on the knowledge base to further improve retrieval
accuracy and as a result, improve response generation. Additionally, we
experiment with a large language model, ChatGPT, to take advantage of the
improved retrieval performance to further improve the generation results.
Experimental results on two datasets show that our approach can increase
retrieval and generation performance. The results also indicate that ChatGPT is
a better response generator for knowledge-grounded dialogue when relevant
knowledge is provided.
Authors' comments: LREC-COLING 2024