Alex Buna, Patrick Rebeschini
Recent progress in robust statistical learning has mainly tackled convex problems, like mean estimation or linear regression, with non-convex challenges receiving less attention. Phase retrieval exemplifies such a non-convex problem, requiring the recovery of a signal from only the magnitudes of its linear measurements, without phase (sign) information. While several non-convex methods, especially those involving the Wirtinger Flow algorithm, have been proposed for noiseless or mild noise settings, developing solutions for heavy-tailed noise and adversarial corruption remains an open challenge. In this paper, we investigate an approach that leverages robust gradient descent techniques to improve the Wirtinger Flow algorithm's ability to simultaneously cope with fourth moment bounded noise and adversarial contamination in both the inputs (covariates) and outputs (responses). We address two scenarios: known zero-mean noise and completely unknown noise. For the latter, we propose a preprocessing step that alters the problem into a new format that does not fit traditional phase retrieval approaches but can still be resolved with a tailored version of the algorithm for the zero-mean noise context.
Jian Zhu, Mingkai Sheng, Zhangmin Huang, Jingfei Chang, Jinling Jiang, Jian Long, Cheng Luo, Lei Liu
Multi-modal hashing methods are widely used in multimedia retrieval, which
can fuse multi-source data to generate binary hash code. However, the
individual backbone networks have limited feature expression capabilities and
are not jointly pre-trained on large-scale unsupervised multi-modal data,
resulting in low retrieval accuracy. To address this issue, we propose a novel
CLIP Multi-modal Hashing (CLIPMH) method. Our method employs the CLIP framework
to extract both text and vision features and then fuses them to generate hash
code. Due to enhancement on each modal feature, our method has great
improvement in the retrieval performance of multi-modal hashing methods.
Compared with state-of-the-art unsupervised and supervised multi-modal hashing
methods, experiments reveal that the proposed CLIPMH can significantly improve
performance (a maximum increase of 8.38% in mAP).
Authors' comments: Accepted by 31st International Conference on MultiMedia Modeling
(MMM2025)
Rong Liu, Yongming Qu
The International Council for Harmonisation of Technical Requirements for
Pharmaceuticals for Human Use (ICH) E9 (R1) Addendum provides a framework for
defining estimands in clinical trials. Treatment policy strategy is the mostly
used approach to handle intercurrent events in defining estimands. Imputing
missing values for potential outcomes under the treatment policy strategy has
been discussed in the literature. Missing values as a result of administrative
study withdrawals (such as site closures due to business reasons, COVID-19
control measures, and geopolitical conflicts, etc.) are often imputed in the
same way as other missing values occurring after intercurrent events related to
safety or efficacy. Some research suggests using a hypothetical strategy to
handle the treatment discontinuations due to administrative study withdrawal in
defining the estimands and imputing the missing values based on completer data
assuming missing at random, but this approach ignores the fact that subjects
might experience other intercurrent events had they not had the administrative
study withdrawal. In this article, we consider the administrative study
withdrawal censors the normal real-world like intercurrent events and propose
two methods for handling the corresponding missing values under the retrieved
dropout imputation framework. Simulation shows the two methods perform well. We
also applied the methods to actual clinical trial data evaluating an
anti-diabetes treatment.
Authors' comments: 16 pages, 5 tables, and 2 figures
Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, Chao Huang
Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user needs. However, existing RAG systems have significant limitations, including reliance on flat data representations and inadequate contextual awareness, which can lead to fragmented answers that fail to capture complex inter-dependencies. To address these challenges, we propose LightRAG, which incorporates graph structures into text indexing and retrieval processes. This innovative framework employs a dual-level retrieval system that enhances comprehensive information retrieval from both low-level and high-level knowledge discovery. Additionally, the integration of graph structures with vector representations facilitates efficient retrieval of related entities and their relationships, significantly improving response times while maintaining contextual relevance. This capability is further enhanced by an incremental update algorithm that ensures the timely integration of new data, allowing the system to remain effective and responsive in rapidly changing data environments. Extensive experimental validation demonstrates considerable improvements in retrieval accuracy and efficiency compared to existing approaches. We have made our LightRAG open-source and available at the link: https://github.com/HKUDS/LightRAG
Yijie Ding, Yupeng Hou, Jiacheng Li, Julian McAuley
Generative recommendation (GR) is an emerging paradigm that tokenizes items into discrete tokens and learns to autoregressively generate the next tokens as predictions. Although effective, GR models operate in a transductive setting, meaning they can only generate items seen during training without applying heuristic re-ranking strategies. In this paper, we propose SpecGR, a plug-and-play framework that enables GR models to recommend new items in an inductive setting. SpecGR uses a drafter model with inductive capability to propose candidate items, which may include both existing items and new items. The GR model then acts as a verifier, accepting or rejecting candidates while retaining its strong ranking capabilities. We further introduce the guided re-drafting technique to make the proposed candidates more aligned with the outputs of generative recommendation models, improving the verification efficiency. We consider two variants for drafting: (1) using an auxiliary drafter model for better flexibility, or (2) leveraging the GR model's own encoder for parameter-efficient self-drafting. Extensive experiments on three real-world datasets demonstrate that SpecGR exhibits both strong inductive recommendation ability and the best overall performance among the compared methods. Our code is available at: https://github.com/Jamesding000/SpecGR.
Yotam Intrator, Ori Kelner, Regev Cohen, Roman Goldenberg, Ehud Rivlin, Daniel Freedman
Information retrieval (IR) methods, like retrieval augmented generation, are
fundamental to modern applications but often lack statistical guarantees.
Conformal prediction addresses this by retrieving sets guaranteed to include
relevant information, yet existing approaches produce large-sized sets,
incurring high computational costs and slow response times. In this work, we
introduce a score refinement method that applies a simple monotone
transformation to retrieval scores, leading to significantly smaller conformal
sets while maintaining their statistical guarantees. Experiments on various
BEIR benchmarks validate the effectiveness of our approach in producing compact
sets containing relevant information.
Authors' comments: 6 pages
Chao-Wei Huang, Yun-Nung Chen
Effective information retrieval (IR) from vast datasets relies on advanced
techniques to extract relevant information in response to queries. Recent
advancements in dense retrieval have showcased remarkable efficacy compared to
traditional sparse retrieval methods. To further enhance retrieval performance,
knowledge distillation techniques, often leveraging robust cross-encoder
rerankers, have been extensively explored. However, existing approaches
primarily distill knowledge from pointwise rerankers, which assign absolute
relevance scores to documents, thus facing challenges related to inconsistent
comparisons. This paper introduces Pairwise Relevance Distillation
(PairDistill) to leverage pairwise reranking, offering fine-grained
distinctions between similarly relevant documents to enrich the training of
dense retrieval models. Our experiments demonstrate that PairDistill
outperforms existing methods, achieving new state-of-the-art results across
multiple benchmarks. This highlights the potential of PairDistill in advancing
dense retrieval techniques effectively. Our source code and trained models are
released at https://github.com/MiuLab/PairDistill
Authors' comments: Accepted to EMNLP 2024 Main Conference
Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, Xueqi Cheng
Generative retrieval represents a novel approach to information retrieval. It
uses an encoder-decoder architecture to directly produce relevant document
identifiers (docids) for queries. While this method offers benefits, current
approaches are limited to scenarios with binary relevance data, overlooking the
potential for documents to have multi-graded relevance. Extending generative
retrieval to accommodate multi-graded relevance poses challenges, including the
need to reconcile likelihood probabilities for docid pairs and the possibility
of multiple relevant documents sharing the same identifier. To address these
challenges, we introduce a framework called GRaded Generative Retrieval
(GR$^2$). GR$^2$ focuses on two key components: ensuring relevant and distinct
identifiers, and implementing multi-graded constrained contrastive training.
First, we create identifiers that are both semantically relevant and
sufficiently distinct to represent individual documents effectively. This is
achieved by jointly optimizing the relevance and distinctness of docids through
a combination of docid generation and autoencoder models. Second, we
incorporate information about the relationship between relevance grades to
guide the training process. We use a constrained contrastive training strategy
to bring the representations of queries and the identifiers of their relevant
documents closer together, based on their respective relevance grades.
Extensive experiments on datasets with both multi-graded and binary relevance
demonstrate the effectiveness of GR$^2$.
Authors' comments: Accepted by the NeurIPS 2024 (Spotlight)
Hung-Ting Chen, Eunsol Choi
We study retrieving a set of documents that covers various perspectives on a complex and contentious question (e.g., will ChatGPT do more harm than good?). We curate a Benchmark for Retrieval Diversity for Subjective questions (BERDS), where each example consists of a question and diverse perspectives associated with the question, sourced from survey questions and debate websites. On this data, retrievers paired with a corpus are evaluated to surface a document set that contains diverse perspectives. Our framing diverges from most retrieval tasks in that document relevancy cannot be decided by simple string matches to references. Instead, we build a language model-based automatic evaluator that decides whether each retrieved document contains a perspective. This allows us to evaluate the performance of three different types of corpus (Wikipedia, web snapshot, and corpus constructed on the fly with retrieved pages from the search engine) paired with retrievers. Retrieving diverse documents remains challenging, with the outputs from existing retrievers covering all perspectives on only 40% of the examples. We further study the effectiveness of query expansion and diversity-focused reranking approaches and analyze retriever sycophancy.
Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen
Despite significant progress in multilingual information retrieval, the lack of models capable of effectively supporting multiple languages, particularly low-resource like Indic languages, remains a critical challenge. This paper presents NLLB-E5: A Scalable Multilingual Retrieval Model. NLLB-E5 leverages the in-built multilingual capabilities in the NLLB encoder for translation tasks. It proposes a distillation approach from multilingual retriever E5 to provide a zero-shot retrieval approach handling multiple languages, including all major Indic languages, without requiring multilingual training data. We evaluate the model on a comprehensive suite of existing benchmarks, including Hindi-BEIR, highlighting its robust performance across diverse languages and tasks. Our findings uncover task and domain-specific challenges, providing valuable insights into the retrieval performance, especially for low-resource languages. NLLB-E5 addresses the urgent need for an inclusive, scalable, and language-agnostic text retrieval model, advancing the field of multilingual information access and promoting digital inclusivity for millions of users globally.
Jüri Keller, Timo Breuer, Philipp Schaer
Information Retrieval (IR) systems are exposed to constant changes in most
components. Documents are created, updated, or deleted, the information needs
are changing, and even relevance might not be static. While it is generally
expected that the IR systems retain a consistent utility for the users, test
collection evaluations rely on a fixed experimental setup. Based on the
LongEval shared task and test collection, this work explores how the
effectiveness measured in evolving experiments can be assessed. Specifically,
the persistency of effectiveness is investigated as a replicability task. It is
observed how the effectiveness progressively deteriorates over time compared to
the initial measurement. Employing adapted replicability measures provides
further insight into the persistence of effectiveness. The ranking of systems
varies across retrieval measures and time. In conclusion, it was found that the
most effective systems are not necessarily the ones with the most persistent
performance.
Authors' comments: Experimental IR Meets Multilinguality, Multimodality, and Interaction
- 15th International Conference of the CLEF Association, CLEF 2024, Grenoble,
France, September 9-12, 2024, Proceedings. arXiv admin note: text overlap
with arXiv:2308.10549
Tuba Gokhan, Kexin Wang, Iryna Gurevych, Ted Briscoe
Regulatory documents, issued by governmental regulatory bodies, establish rules, guidelines, and standards that organizations must adhere to for legal compliance. These documents, characterized by their length, complexity and frequent updates, are challenging to interpret, requiring significant allocation of time and expertise on the part of organizations to ensure ongoing compliance. Regulatory Natural Language Processing (RegNLP) is a multidisciplinary field aimed at simplifying access to and interpretation of regulatory rules and obligations. We introduce a task of generating question-passages pairs, where questions are automatically created and paired with relevant regulatory passages, facilitating the development of regulatory question-answering systems. We create the ObliQA dataset, containing 27,869 questions derived from the collection of Abu Dhabi Global Markets (ADGM) financial regulation documents, design a baseline Regulatory Information Retrieval and Answer Generation (RIRAG) system and evaluate it with RePASs, a novel evaluation metric that tests whether generated answers accurately capture all relevant obligations while avoiding contradictions.
Sakuna Harinda Jayasundara, Nalin Asanka Gamagedara Arachchilage, Giovanni Russello
Manually generating access control policies from an organization's high-level
requirement specifications poses significant challenges. It requires laborious
efforts to sift through multiple documents containing such specifications and
translate their access requirements into access control policies. Also, the
complexities and ambiguities of these specifications often result in errors by
system administrators during the translation process, leading to data breaches.
However, the automated policy generation frameworks designed to help
administrators in this process are unreliable due to limitations, such as the
lack of domain adaptation. Therefore, to improve the reliability of access
control policy generation, we propose RAGent, a novel retrieval-based access
control policy generation framework based on language models. RAGent identifies
access requirements from high-level requirement specifications with an average
state-of-the-art F1 score of 87.9%. Through retrieval augmented generation,
RAGent then translates the identified access requirements into access control
policies with an F1 score of 77.9%. Unlike existing frameworks, RAGent
generates policies with complex components like purposes and conditions, in
addition to subjects, actions, and resources. Moreover, RAGent automatically
verifies the generated policies and iteratively refines them through a novel
verification-refinement mechanism, further improving the reliability of the
process by 3%, reaching the F1 score of 80.6%. We also introduce three
annotated datasets for developing access control policy generation frameworks
in the future, addressing the data scarcity of the domain.
Authors' comments: Submitted to Usenix 2025
Danilo Dordevic, Suryansh Kumar
We introduce the Evidential Transformer, an uncertainty-driven transformer
model for improved and robust image retrieval. In this paper, we make several
contributions to content-based image retrieval (CBIR). We incorporate
probabilistic methods into image retrieval, achieving robust and reliable
results, with evidential classification surpassing traditional training based
on multiclass classification as a baseline for deep metric learning.
Furthermore, we improve the state-of-the-art retrieval results on several
datasets by leveraging the Global Context Vision Transformer (GC ViT)
architecture. Our experimental results consistently demonstrate the reliability
of our approach, setting a new benchmark in CBIR in all test settings on the
Stanford Online Products (SOP) and CUB-200-2011 datasets.
Authors' comments: 6 pages, 6 figures, To be presented at the 3rd Workshop on
Uncertainty Quantification for Computer Vision, at the ECCV 2024 conference
in Milan, Italy
Benjamin L. Badger
Attention mechanisms that confer selective focus on a strict subset of input
elements are nearly ubiquitous in language models today. We posit there to be
downside to the use of attention: most input information is lost. In support of
this idea we observe poor input representation accuracy in transformers and
more accurate representation in what we term masked mixers, which replace
self-attention with masked convolutions. The masked mixer learns causal
language modeling more efficiently than early transformer implementations and
even outperforms optimized, current transformers when training on small
($n_{ctx}<512$) but not larger context windows. Evidence is presented for the
hypothesis that differences in transformer and masked mixer training
efficiencies for various tasks are best predicted by input representation
accuracy, or equivalently global invertibility. We hypothesize that the
information loss exhibited by transformers would be more detrimental to
retrieval than generation, as the former is more closely approximated by a
bijective and thus invertible function. We find that masked mixers are more
effective retrieval models both when the pretrained embedding model is
unchanged as well as when the embedding model is modified via cosine
similarity-based InfoNCE loss minimization. A small masked mixer is shown to
outperform a large and near state-of-the-art transformer-based retrieval model,
despite the latter being trained with many orders of magnitude more data and
compute.
Authors' comments: 31 pages, 9 figures, 4 tables, 14 supplementary figures, 10
supplementary tables
Bhavik Chandna, Procheta Sen
Explainability has become a crucial concern in today's world, aiming to enhance transparency in machine learning and deep learning models. Information retrieval is no exception to this trend. In existing literature on explainability of information retrieval, the emphasis has predominantly been on illustrating the concept of relevance concerning a retrieval model. The questions addressed include why a document is relevant to a query, why one document exhibits higher relevance than another, or why a specific set of documents is deemed relevant for a query. However, limited attention has been given to understanding why a particular document is not favored (e.g. not within top-K) with respect to a query and a retrieval model. In an effort to address this gap, our work focus on the question of what terms need to be added within a document to improve its ranking. This in turn answers the question of which words played a role in not being favored in the document by a retrieval model for a particular query. We use a counterfactual framework to solve the above-mentioned research problem. To the best of our knowledge, we mark the first attempt to tackle this specific counterfactual problem (i.e. examining the absence of which words can affect the ranking of a document). Our experiments show the effectiveness of our proposed approach in predicting counterfactuals for both statistical (e.g. BM25) and deep-learning-based models (e.g. DRMM, DSSM, ColBERT, MonoT5). The code implementation of our proposed approach is available in https://anonymous.4open.science/r/CfIR-v2.
Tim Raven, Arthur Matei, Gernot A. Fink
While methods based on Vision Transformers (ViT) have achieved state-of-the-art performance in many domains, they have not yet been applied successfully in the domain of writer retrieval. The field is dominated by methods using handcrafted features or features extracted from Convolutional Neural Networks. In this work, we bridge this gap and present a novel method that extracts features from a ViT and aggregates them using VLAD encoding. The model is trained in a self-supervised fashion without any need for labels. We show that extracting local foreground features is superior to using the ViT's class token in the context of writer retrieval. We evaluate our method on two historical document collections. We set a new state-at-of-art performance on the Historical-WI dataset (83.1\% mAP), and the HisIR19 dataset (95.0\% mAP). Additionally, we demonstrate that our ViT feature extractor can be directly applied to modern datasets such as the CVL database (98.6\% mAP) without any fine-tuning.
Yuxia Wu, Lizi Liao, Yuan Fang
Modeling dynamic graphs, such as those found in social networks,
recommendation systems, and e-commerce platforms, is crucial for capturing
evolving relationships and delivering relevant insights over time. Traditional
approaches primarily rely on graph neural networks with temporal components or
sequence generation models, which often focus narrowly on the historical
context of target nodes. This limitation restricts the ability to adapt to new
and emerging patterns in dynamic graphs. To address this challenge, we propose
a novel framework, Retrieval-Augmented Generation for Dynamic Graph modeling
(RAG4DyG), which enhances dynamic graph predictions by incorporating
contextually and temporally relevant examples from broader graph structures.
Our approach includes a time- and context-aware contrastive learning module to
identify high-quality demonstrations and a graph fusion strategy to effectively
integrate these examples with historical contexts. The proposed framework is
designed to be effective in both transductive and inductive scenarios, ensuring
adaptability to previously unseen nodes and evolving graph structures.
Extensive experiments across multiple real-world datasets demonstrate the
effectiveness of RAG4DyG in improving predictive accuracy and adaptability for
dynamic graph modeling. The code and datasets are publicly available at
https://github.com/YuxiaWu/RAG4DyG.
Authors' comments: Accepted by SIGIR 2025
Richard Zanibbi, Behrooz Mansouri, Anurag Agarwal
Mathematical information is essential for technical work, but its creation,
interpretation, and search are challenging. To help address these challenges,
researchers have developed multimodal search engines and mathematical question
answering systems. This book begins with a simple framework characterizing the
information tasks that people and systems perform as we work to answer
math-related questions. The framework is used to organize and relate the other
core topics of the book, including interactions between people and systems,
representing math formulas in sources, and evaluation. We close by addressing
some key questions and presenting directions for future work. This book is
intended for students, instructors, and researchers interested in systems that
help us find and use mathematical information.
Authors' comments: [DRAFT] Revised (3rd) draft
Kavsar Huseynova, Jafar Isbarov
Document retrieval systems have experienced a revitalized interest with the
advent of retrieval-augmented generation (RAG). RAG architecture offers a lower
hallucination rate than LLM-only applications. However, the accuracy of the
retrieval mechanism is known to be a bottleneck in the efficiency of these
applications. A particular case of subpar retrieval performance is observed in
situations where multiple documents from several different but related topics
are in the corpus. We have devised a new vectorization method that takes into
account the topic information of the document. The paper introduces this new
method for text vectorization and evaluates it in the context of RAG.
Furthermore, we discuss the challenge of evaluating RAG systems, which pertains
to the case at hand.
Authors' comments: Accepted to AICT 2024