Sakuna Harinda Jayasundara, Nalin Asanka Gamagedara Arachchilage, Giovanni Russello
Manually generating access control policies from an organization's high-level
requirement specifications poses significant challenges. It requires laborious
efforts to sift through multiple documents containing such specifications and
translate their access requirements into access control policies. Also, the
complexities and ambiguities of these specifications often result in errors by
system administrators during the translation process, leading to data breaches.
However, the automated policy generation frameworks designed to help
administrators in this process are unreliable due to limitations, such as the
lack of domain adaptation. Therefore, to improve the reliability of access
control policy generation, we propose RAGent, a novel retrieval-based access
control policy generation framework based on language models. RAGent identifies
access requirements from high-level requirement specifications with an average
state-of-the-art F1 score of 87.9%. Through retrieval augmented generation,
RAGent then translates the identified access requirements into access control
policies with an F1 score of 77.9%. Unlike existing frameworks, RAGent
generates policies with complex components like purposes and conditions, in
addition to subjects, actions, and resources. Moreover, RAGent automatically
verifies the generated policies and iteratively refines them through a novel
verification-refinement mechanism, further improving the reliability of the
process by 3%, reaching the F1 score of 80.6%. We also introduce three
annotated datasets for developing access control policy generation frameworks
in the future, addressing the data scarcity of the domain.
Authors' comments: Submitted to Usenix 2025
Danilo Dordevic, Suryansh Kumar
We introduce the Evidential Transformer, an uncertainty-driven transformer
model for improved and robust image retrieval. In this paper, we make several
contributions to content-based image retrieval (CBIR). We incorporate
probabilistic methods into image retrieval, achieving robust and reliable
results, with evidential classification surpassing traditional training based
on multiclass classification as a baseline for deep metric learning.
Furthermore, we improve the state-of-the-art retrieval results on several
datasets by leveraging the Global Context Vision Transformer (GC ViT)
architecture. Our experimental results consistently demonstrate the reliability
of our approach, setting a new benchmark in CBIR in all test settings on the
Stanford Online Products (SOP) and CUB-200-2011 datasets.
Authors' comments: 6 pages, 6 figures, To be presented at the 3rd Workshop on
Uncertainty Quantification for Computer Vision, at the ECCV 2024 conference
in Milan, Italy
Benjamin L. Badger
Attention mechanisms that confer selective focus on a strict subset of input
elements are nearly ubiquitous in language models today. We posit there to be
downside to the use of attention: most input information is lost. In support of
this idea we observe poor input representation accuracy in transformers and
more accurate representation in what we term masked mixers, which replace
self-attention with masked convolutions. The masked mixer learns causal
language modeling more efficiently than early transformer implementations and
even outperforms optimized, current transformers when training on small
($n_{ctx}<512$) but not larger context windows. Evidence is presented for the
hypothesis that differences in transformer and masked mixer training
efficiencies for various tasks are best predicted by input representation
accuracy, or equivalently global invertibility. We hypothesize that the
information loss exhibited by transformers would be more detrimental to
retrieval than generation, as the former is more closely approximated by a
bijective and thus invertible function. We find that masked mixers are more
effective retrieval models both when the pretrained embedding model is
unchanged as well as when the embedding model is modified via cosine
similarity-based InfoNCE loss minimization. A small masked mixer is shown to
outperform a large and near state-of-the-art transformer-based retrieval model,
despite the latter being trained with many orders of magnitude more data and
compute.
Authors' comments: 31 pages, 9 figures, 4 tables, 14 supplementary figures, 10
supplementary tables
Bhavik Chandna, Procheta Sen
Explainability has become a crucial concern in today's world, aiming to enhance transparency in machine learning and deep learning models. Information retrieval is no exception to this trend. In existing literature on explainability of information retrieval, the emphasis has predominantly been on illustrating the concept of relevance concerning a retrieval model. The questions addressed include why a document is relevant to a query, why one document exhibits higher relevance than another, or why a specific set of documents is deemed relevant for a query. However, limited attention has been given to understanding why a particular document is not favored (e.g. not within top-K) with respect to a query and a retrieval model. In an effort to address this gap, our work focus on the question of what terms need to be added within a document to improve its ranking. This in turn answers the question of which words played a role in not being favored in the document by a retrieval model for a particular query. We use a counterfactual framework to solve the above-mentioned research problem. To the best of our knowledge, we mark the first attempt to tackle this specific counterfactual problem (i.e. examining the absence of which words can affect the ranking of a document). Our experiments show the effectiveness of our proposed approach in predicting counterfactuals for both statistical (e.g. BM25) and deep-learning-based models (e.g. DRMM, DSSM, ColBERT, MonoT5). The code implementation of our proposed approach is available in https://anonymous.4open.science/r/CfIR-v2.
Tim Raven, Arthur Matei, Gernot A. Fink
While methods based on Vision Transformers (ViT) have achieved state-of-the-art performance in many domains, they have not yet been applied successfully in the domain of writer retrieval. The field is dominated by methods using handcrafted features or features extracted from Convolutional Neural Networks. In this work, we bridge this gap and present a novel method that extracts features from a ViT and aggregates them using VLAD encoding. The model is trained in a self-supervised fashion without any need for labels. We show that extracting local foreground features is superior to using the ViT's class token in the context of writer retrieval. We evaluate our method on two historical document collections. We set a new state-at-of-art performance on the Historical-WI dataset (83.1\% mAP), and the HisIR19 dataset (95.0\% mAP). Additionally, we demonstrate that our ViT feature extractor can be directly applied to modern datasets such as the CVL database (98.6\% mAP) without any fine-tuning.
Yuxia Wu, Lizi Liao, Yuan Fang
Modeling dynamic graphs, such as those found in social networks,
recommendation systems, and e-commerce platforms, is crucial for capturing
evolving relationships and delivering relevant insights over time. Traditional
approaches primarily rely on graph neural networks with temporal components or
sequence generation models, which often focus narrowly on the historical
context of target nodes. This limitation restricts the ability to adapt to new
and emerging patterns in dynamic graphs. To address this challenge, we propose
a novel framework, Retrieval-Augmented Generation for Dynamic Graph modeling
(RAG4DyG), which enhances dynamic graph predictions by incorporating
contextually and temporally relevant examples from broader graph structures.
Our approach includes a time- and context-aware contrastive learning module to
identify high-quality demonstrations and a graph fusion strategy to effectively
integrate these examples with historical contexts. The proposed framework is
designed to be effective in both transductive and inductive scenarios, ensuring
adaptability to previously unseen nodes and evolving graph structures.
Extensive experiments across multiple real-world datasets demonstrate the
effectiveness of RAG4DyG in improving predictive accuracy and adaptability for
dynamic graph modeling. The code and datasets are publicly available at
https://github.com/YuxiaWu/RAG4DyG.
Authors' comments: Accepted by SIGIR 2025
Richard Zanibbi, Behrooz Mansouri, Anurag Agarwal
Mathematical information is essential for technical work, but its creation,
interpretation, and search are challenging. To help address these challenges,
researchers have developed multimodal search engines and mathematical question
answering systems. This book begins with a simple framework characterizing the
information tasks that people and systems perform as we work to answer
math-related questions. The framework is used to organize and relate the other
core topics of the book, including interactions between people and systems,
representing math formulas in sources, and evaluation. We close by addressing
some key questions and presenting directions for future work. This book is
intended for students, instructors, and researchers interested in systems that
help us find and use mathematical information.
Authors' comments: [DRAFT] Revised (3rd) draft
Kavsar Huseynova, Jafar Isbarov
Document retrieval systems have experienced a revitalized interest with the
advent of retrieval-augmented generation (RAG). RAG architecture offers a lower
hallucination rate than LLM-only applications. However, the accuracy of the
retrieval mechanism is known to be a bottleneck in the efficiency of these
applications. A particular case of subpar retrieval performance is observed in
situations where multiple documents from several different but related topics
are in the corpus. We have devised a new vectorization method that takes into
account the topic information of the document. The paper introduces this new
method for text vectorization and evaluates it in the context of RAG.
Furthermore, we discuss the challenge of evaluating RAG systems, which pertains
to the case at hand.
Authors' comments: Accepted to AICT 2024
Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, Siliang Tang
Recently, Retrieval-Augmented Generation (RAG) has achieved remarkable
success in addressing the challenges of Large Language Models (LLMs) without
necessitating retraining. By referencing an external knowledge base, RAG
refines LLM outputs, effectively mitigating issues such as ``hallucination'',
lack of domain-specific knowledge, and outdated information. However, the
complex structure of relationships among different entities in databases
presents challenges for RAG systems. In response, GraphRAG leverages structural
information across entities to enable more precise and comprehensive retrieval,
capturing relational knowledge and facilitating more accurate, context-aware
responses. Given the novelty and potential of GraphRAG, a systematic review of
current technologies is imperative. This paper provides the first comprehensive
overview of GraphRAG methodologies. We formalize the GraphRAG workflow,
encompassing Graph-Based Indexing, Graph-Guided Retrieval, and Graph-Enhanced
Generation. We then outline the core technologies and training methods at each
stage. Additionally, we examine downstream tasks, application domains,
evaluation methodologies, and industrial use cases of GraphRAG. Finally, we
explore future research directions to inspire further inquiries and advance
progress in the field.
Authors' comments: Ongoing work
Samhaa R. El-Beltagy, Mohamed A. Abdallah
Recently, Retrieval Augmented Generation (RAG) has emerged as a powerful technique in natural language processing, combining the strengths of retrieval-based and generation-based models to enhance text generation tasks. However, the application of RAG in Arabic, a language with unique characteristics and resource constraints, remains underexplored. This paper presents a comprehensive case study on the implementation and evaluation of RAG for Arabic text. The work focuses on exploring various semantic embedding models in the retrieval stage and several LLMs in the generation stage, in order to investigate what works and what doesn't in the context of Arabic. The work also touches upon the issue of variations between document dialect and query dialect in the retrieval stage. Results show that existing semantic embedding models and LLMs can be effectively employed to build Arabic RAG pipelines.
Nicholas Rossi, Juexin Lin, Feng Liu, Zhen Yang, Tony Lee, Alessandro Magnani, Ciya Liao
In embedding-based retrieval, Approximate Nearest Neighbor (ANN) search
enables efficient retrieval of similar items from large-scale datasets. While
maximizing recall of relevant items is usually the goal of retrieval systems, a
low precision may lead to a poor search experience. Unlike lexical retrieval,
which inherently limits the size of the retrieved set through keyword matching,
dense retrieval via ANN search has no natural cutoff. Moreover, the cosine
similarity scores of embedding vectors are often optimized via contrastive or
ranking losses, which make them difficult to interpret. Consequently, relying
on top-K or cosine-similarity cutoff is often insufficient to filter out
irrelevant results effectively. This issue is prominent in product search,
where the number of relevant products is often small. This paper introduces a
novel relevance filtering component (called "Cosine Adapter") for
embedding-based retrieval to address this challenge. Our approach maps raw
cosine similarity scores to interpretable scores using a query-dependent
mapping function. We then apply a global threshold on the mapped scores to
filter out irrelevant results. We are able to significantly increase the
precision of the retrieved set, at the expense of a small loss of recall. The
effectiveness of our approach is demonstrated through experiments on both
public MS MARCO dataset and internal Walmart product search data. Furthermore,
online A/B testing on the Walmart site validates the practical value of our
approach in real-world e-commerce settings.
Authors' comments: 8 pages, 3 figures, CIKM 2024
Hanjia Lyu, Hanqing Zeng, Yinglong Xia, Ren Chen, Jiebo Luo
Many existing industrial recommender systems are sensitive to the patterns of user-item engagement. Light users, who interact less frequently, correspond to a data sparsity problem, making it difficult for the system to accurately learn and represent their preferences. On the other hand, heavy users with rich interaction history often demonstrate a variety of niche interests that are hard to be precisely captured under the standard "user-item" similarity measurement. Moreover, implementing these systems in an industrial environment necessitates that they are resource-efficient and scalable to process web-scale data under strict latency constraints. In this paper, we address these challenges by introducing an intermediate "interest" layer between users and items. We propose a novel approach that efficiently constructs user interest and facilitates low computational cost inference by clustering engagement graphs and incorporating user-interest attention. This method enhances the understanding of light users' preferences by linking them with heavy users. By integrating user-interest attention, our approach allows a more personalized similarity metric, adept at capturing the complex dynamics of user-item interactions. The use of interest as an intermediary layer fosters a balance between scalability and expressiveness in the model. Evaluations on two public datasets reveal that our method not only achieves improved recommendation performance but also demonstrates enhanced computational efficiency compared to item-level attention models. Our approach has also been deployed in multiple products at Meta, facilitating short-form video related recommendation.
Hassan S. Shavarani, Anoop Sarkar
The similarity between the question and indexed documents is a crucial factor
in document retrieval for retrieval-augmented question answering. Although this
is typically the only method for obtaining the relevant documents, it is not
the sole approach when dealing with entity-centric questions. In this study, we
propose Entity Retrieval, a novel retrieval method which rather than relying on
question-document similarity, depends on the salient entities within the
question to identify the retrieval documents. We conduct an in-depth analysis
of the performance of both dense and sparse retrieval methods in comparison to
Entity Retrieval. Our findings reveal that our method not only leads to more
accurate answers to entity-centric questions but also operates more
efficiently.
Authors' comments: 17 pages total, 10 Tables, 4 Figures
Arian Askari, Chuan Meng, Mohammad Aliannejadi, Zhaochun Ren, Evangelos Kanoulas, Suzan Verberne
Existing generative retrieval (GR) approaches rely on training-based indexing, i.e., fine-tuning a model to memorise the associations between a query and the document identifier (docid) of a relevant document. Training-based indexing has three limitations: high training overhead, under-utilization of the pre-trained knowledge of large language models (LLMs), and challenges in adapting to a dynamic document corpus. To address the above issues, we propose a novel few-shot indexing-based GR framework (Few-Shot GR). It has a novel few-shot indexing process, where we prompt an LLM to generate docids for all documents in a corpus, ultimately creating a docid bank for the entire corpus. During retrieval, we feed a query to the same LLM and constrain it to generate a docid within the docid bank created during indexing, and then map the generated docid back to its corresponding document. Few-Shot GR relies solely on prompting an LLM without requiring any training, making it more efficient. Moreover, we devise few-shot indexing with one-to-many mapping to further enhance Few-Shot GR. Experiments show that Few-Shot GR achieves superior performance to state-of-the-art GR methods that require heavy training.
Xi Wang, Procheta Sen, Ruizhe Li, Emine Yilmaz
Despite the success of integrating large language models into the development
of conversational systems, many studies have shown the effectiveness of
retrieving and augmenting external knowledge for informative responses. Hence,
many existing studies commonly assume the always need for Retrieval Augmented
Generation (RAG) in a conversational system without explicit control. This
raises a research question about such a necessity. In this study, we propose to
investigate the need for each turn of system response to be augmented with
external knowledge. In particular, by leveraging human judgements on the binary
choice of adaptive augmentation, we develop RAGate, a gating model, which
models conversation context and relevant inputs to predict if a conversational
system requires RAG for improved responses. We conduct extensive experiments on
devising and applying RAGate to conversational models and well-rounded analyses
of different conversational scenarios. Our experimental results and analysis
indicate the effective application of RAGate in RAG-based conversational
systems in identifying system responses for appropriate RAG with high-quality
responses and a high generation confidence. This study also identifies the
correlation between the generation's confidence level and the relevance of the
augmented knowledge.
Authors' comments: 12 pages, under review
To Eun Kim, Alireza Salemi, Andrew Drozdov, Fernando Diaz, Hamed Zamani
In the field of language modeling, models augmented with retrieval components have emerged as a promising solution to address several challenges faced in the natural language processing (NLP) field, including knowledge grounding, interpretability, and scalability. Despite the primary focus on NLP, we posit that the paradigm of retrieval-enhancement can be extended to a broader spectrum of machine learning (ML) such as computer vision, time series prediction, and computational biology. Therefore, this work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature. Also, we found that while a number of studies employ retrieval components to augment their models, there is a lack of integration with foundational Information Retrieval (IR) research. We bridge this gap between the seminal IR research and contemporary REML studies by investigating each component that comprises the REML framework. Ultimately, the goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.
Mohammad Aliannejadi, Jacek Gwizdka, Hamed Zamani
At its core, information access and seeking is an interactive process. In
existing search engines, interactions are limited to a few pre-defined actions,
such as "requery", "click on a document", "scrolling up/down", "going to the
next result page", "leaving the search engine", etc. A major benefit of moving
towards generative IR systems is enabling users with a richer expression of
information need and feedback and free-form interactions in natural language
and beyond. In other words, the actions users take are no longer limited by the
clickable links and buttons available on the search engine result page and
users can express themselves freely through natural language. This can go even
beyond natural language, through images, videos, gestures, and sensors using
multi-modal generative IR systems. This chapter briefly discusses the role of
interaction in generative IR systems. We will first discuss different ways
users can express their information needs by interacting with generative IR
systems. We then explain how users can provide explicit or implicit feedback to
generative IR systems and how they can consume such feedback. Next, we will
cover how users interactively can refine retrieval results. We will expand upon
mixed-initiative interactions and discuss clarification and preference
elicitation in more detail. We then discuss proactive generative IR systems,
including context-aware recommendation, following up past conversations,
contributing to multi-party conversations, and feedback requests. Providing
explanation is another interaction type that we briefly discuss in this
chapter. We will also briefly describe multi-modal interactions in generative
information retrieval. Finally, we describe emerging frameworks and solutions
for user interfaces with generative AI systems.
Authors' comments: Draft of a chapter intended to appear in a forthcoming book on
generative information retrieval, co-edited by Chirag Shah and Ryen White
Yuxuan Wu, Xiao Yi, Yang Tan, Huiqun Yu, Guisheng Fan, Gaowei Zheng
Protein retrieval, which targets the deconstruction of the relationship
between sequences, structures and functions, empowers the advancing of biology.
Basic Local Alignment Search Tool (BLAST), a sequence-similarity-based
algorithm, has proved the efficiency of this field. Despite the existing tools
for protein retrieval, they prioritize sequence similarity and probably
overlook proteins that are dissimilar but share homology or functionality. In
order to tackle this problem, we propose a novel protein retrieval framework
that mitigates the bias towards sequence similarity. Our framework initiatively
harnesses protein language models (PLMs) to embed protein sequences within a
high-dimensional feature space, thereby enhancing the representation capacity
for subsequent analysis. Subsequently, an accelerated indexed vector database
is constructed to facilitate expedited access and retrieval of dense vectors.
Extensive experiments demonstrate that our framework can equally retrieve both
similar and dissimilar proteins. Moreover, this approach enables the
identification of proteins that conventional methods fail to uncover. This
framework will effectively assist in protein mining and empower the development
of biology.
Authors' comments: 16 pages, 12 figures
Alex Oesterling, Claudio Mayrink Verdun, Carol Xuan Long, Alexander Glynn, Lucas Monteiro Paes, Sajani Vithana, Martina Cardone, Flavio P. Calmon
Image search and retrieval tasks can perpetuate harmful stereotypes, erase
cultural identities, and amplify social disparities. Current approaches to
mitigate these representational harms balance the number of retrieved items
across population groups defined by a small number of (often binary)
attributes. However, most existing methods overlook intersectional groups
determined by combinations of group attributes, such as gender, race, and
ethnicity. We introduce Multi-Group Proportional Representation (MPR), a novel
metric that measures representation across intersectional groups. We develop
practical methods for estimating MPR, provide theoretical guarantees, and
propose optimization algorithms to ensure MPR in retrieval. We demonstrate that
existing methods optimizing for equal and proportional representation metrics
may fail to promote MPR. Crucially, our work shows that optimizing MPR yields
more proportional representation across multiple intersectional groups
specified by a rich function class, often with minimal compromise in retrieval
accuracy.
Authors' comments: 48 pages, 33 figures. Accepted as poster at NeurIPS 2024. Code can be
found at
https://github.com/alex-oesterling/multigroup-proportional-representation
Zongyang Ma, Ziqi Zhang, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Yingmin Luo, Xu Li et al.
Understanding the content of events occurring in the video and their inherent
temporal logic is crucial for video-text retrieval. However, web-crawled
pre-training datasets often lack sufficient event information, and the widely
adopted video-level cross-modal contrastive learning also struggles to capture
detailed and complex video-text event alignment. To address these challenges,
we make improvements from both data and model perspectives. In terms of
pre-training data, we focus on supplementing the missing specific event content
and event temporal transitions with the proposed event augmentation strategies.
Based on the event-augmented data, we construct a novel Event-Aware Video-Text
Retrieval model, ie, EA-VTR, which achieves powerful video-text retrieval
ability through superior video event awareness. EA-VTR can efficiently encode
frame-level and video-level visual representations simultaneously, enabling
detailed event content and complex event temporal cross-modal alignment,
ultimately enhancing the comprehensive understanding of video events. Our
method not only significantly outperforms existing approaches on multiple
datasets for Text-to-Video Retrieval and Video Action Recognition tasks, but
also demonstrates superior event content perceive ability on Multi-event
Video-Text Retrieval and Video Moment Retrieval tasks, as well as outstanding
event temporal logic understanding ability on Test of Time task.
Authors' comments: Accepted by ECCV 2024