Chaeyun Jang, Hyungi Lee, Seanie Lee, Juho Lee
Recently, large language models (LLMs) have been increasingly used to support various decision-making tasks, assisting humans in making informed decisions. However, when LLMs confidently provide incorrect information, it can lead humans to make suboptimal decisions. To prevent LLMs from generating incorrect information on topics they are unsure of and to improve the accuracy of generated content, prior works have proposed Retrieval Augmented Generation (RAG), where external documents are referenced to generate responses. However, traditional RAG methods focus only on retrieving documents most relevant to the input query, without specifically aiming to ensure that the human user's decisions are well-calibrated. To address this limitation, we propose a novel retrieval method called Calibrated Retrieval-Augmented Generation (CalibRAG), which ensures that decisions informed by the retrieved documents are well-calibrated. Then we empirically validate that CalibRAG improves calibration performance as well as accuracy, compared to other baselines across various datasets.
Mandeep Rathee, Sean MacAvaney, Avishek Anand
Building relevance models to rank documents based on user information needs
is a central task in information retrieval and the NLP community. Beyond the
direct ad-hoc search setting, many knowledge-intense tasks are powered by a
first-stage retrieval stage for context selection, followed by a more involved
task-specific model. However, most first-stage ranking stages are inherently
limited by the recall of the initial ranking documents. Recently, adaptive
re-ranking techniques have been proposed to overcome this issue by continually
selecting documents from the whole corpus, rather than only considering an
initial pool of documents. However, so far these approaches have been limited
to heuristic design choices, particularly in terms of the criteria for document
selection. In this work, we propose a unifying view of the nascent area of
adaptive retrieval by proposing, Quam, a \textit{query-affinity model} that
exploits the relevance-aware document similarity graph to improve recall,
especially for low re-ranking budgets. Our extensive experimental evidence
shows that our proposed approach, Quam improves the recall performance by up to
26\% over the standard re-ranking baselines. Further, the query affinity
modelling and relevance-aware document graph modules can be injected into any
adaptive retrieval approach. The experimental results show the existing
adaptive retrieval approach improves recall by up to 12\%. The code of our work
is available at \url{https://github.com/Mandeep-Rathee/quam}.
Authors' comments: 15 pages, 10 figures
Dehai Min, Zhiyang Xu, Guilin Qi, Lifu Huang, Chenyu You
Existing information retrieval (IR) models often assume a homogeneous structure for knowledge sources and user queries, limiting their applicability in real-world settings where retrieval is inherently heterogeneous and diverse. In this paper, we introduce UniHGKR, a unified instruction-aware heterogeneous knowledge retriever that (1) builds a unified retrieval space for heterogeneous knowledge and (2) follows diverse user instructions to retrieve knowledge of specified types. UniHGKR consists of three principal stages: heterogeneous self-supervised pretraining, text-anchored embedding alignment, and instruction-aware retriever fine-tuning, enabling it to generalize across varied retrieval contexts. This framework is highly scalable, with a BERT-based version and a UniHGKR-7B version trained on large language models. Also, we introduce CompMix-IR, the first native heterogeneous knowledge retrieval benchmark. It includes two retrieval scenarios with various instructions, over 9,400 question-answer (QA) pairs, and a corpus of 10 million entries, covering four different types of data. Extensive experiments show that UniHGKR consistently outperforms state-of-the-art methods on CompMix-IR, achieving up to 6.36% and 54.23% relative improvements in two scenarios, respectively. Finally, by equipping our retriever for open-domain heterogeneous QA systems, we achieve a new state-of-the-art result on the popular ConvMix task, with an absolute improvement of up to 4.80 points.
Dong-han Yeom
In this article, we review the information loss paradox in the spirit of the
Euclidean path integral approach. First, we argue that there is a long debate
about the information loss paradox, and the non-perturbative quantum
gravitational wave function must include the clue to the paradox. The Euclidean
path integral approach provides the best way to describe the wave function.
From this wave function, we can notice that there are not only semi-classical
but also non-perturbative contributions, which are highly suppressed but
preserved information. Information retrieval will be sufficiently explained if
such non-perturbative contributions must be dominated by the late time. We will
show that there is sufficient evidence that this scenario can be realized in
generic circumstances. Finally, we compare this scenario with alternative
approaches. Also, we comment on some unresolved issues that need to be
clarified.
Authors' comments: 28 pages, 10 figures; Invited chapter for the edited book "The Black
Hole Information Paradox'' (Eds. Ali Akil and Cosimo Bambi, Springer
Singapore, expected in 2025)
Hadeel Saadany, Swapnil Bhosale, Samarth Agrawal, Diptesh Kanojia, Constantin Orasan, Zhe Wu
This paper addresses the challenge of improving user experience on e-commerce
platforms by enhancing product ranking relevant to users' search queries.
Ambiguity and complexity of user queries often lead to a mismatch between the
user's intent and retrieved product titles or documents. Recent approaches have
proposed the use of Transformer-based models, which need millions of annotated
query-title pairs during the pre-training stage, and this data often does not
take user intent into account. To tackle this, we curate samples from existing
datasets at eBay, manually annotated with buyer-centric relevance scores and
centrality scores, which reflect how well the product title matches the users'
intent. We introduce a User-intent Centrality Optimization (UCO) approach for
existing models, which optimises for the user intent in semantic product
search. To that end, we propose a dual-loss based optimisation to handle hard
negatives, i.e., product titles that are semantically relevant but do not
reflect the user's intent. Our contributions include curating challenging
evaluation sets and implementing UCO, resulting in significant product ranking
efficiency improvements observed for different evaluation metrics. Our work
aims to ensure that the most buyer-centric titles for a query are ranked
higher, thereby, enhancing the user experience on e-commerce platforms.
Authors' comments: EMNLP 2024: Industry track
Lu Dai, Hao Liu, Hui Xiong
Retrieval module can be plugged into many downstream NLP tasks to improve
their performance, such as open-domain question answering and
retrieval-augmented generation. The key to a retrieval system is to calculate
relevance scores to query and passage pairs. However, the definition of
relevance is often ambiguous. We observed that a major class of relevance
aligns with the concept of entailment in NLI tasks. Based on this observation,
we designed a method called entailment tuning to improve the embedding of dense
retrievers. Specifically, we unify the form of retrieval data and NLI data
using existence claim as a bridge. Then, we train retrievers to predict the
claims entailed in a passage with a variant task of masked prediction. Our
method can be efficiently plugged into current dense retrieval methods, and
experiments show the effectiveness of our method.
Authors' comments: EMNLP 2024 Main
Paul Youssef, Jörg Schlötterer, Christin Seifert
Pre-trained Language Models (PLMs) encode various facts about the world at their pre-training phase as they are trained to predict the next or missing word in a sentence. There has a been an interest in quantifying and improving the amount of facts that can be extracted from PLMs, as they have been envisioned to act as soft knowledge bases, which can be queried in natural language. Different approaches exist to enhance fact retrieval from PLM. Recent work shows that the hidden states of PLMs can be leveraged to determine the truthfulness of the PLMs' inputs. Leveraging this finding to improve factual knowledge retrieval remains unexplored. In this work, we investigate the use of a helper model to improve fact retrieval. The helper model assesses the truthfulness of an input based on the corresponding hidden states representations from the PLMs. We evaluate this approach on several masked PLMs and show that it enhances fact retrieval by up to 33\%. Our findings highlight the potential of hidden states representations from PLMs in improving their factual knowledge retrieval.
Wenjia Zhai
Traditional Retrieval-Augmented Generation (RAG) methods are limited by their reliance on a fixed number of retrieved documents, often resulting in incomplete or noisy information that undermines task performance. Although recent adaptive approaches alleviated these problems, their application in intricate and real-world multimodal tasks remains limited. To address these, we propose a new approach called Self-adaptive Multimodal Retrieval-Augmented Generation (SAM-RAG), tailored specifically for multimodal contexts. SAM-RAG not only dynamically filters relevant documents based on the input query, including image captions when needed, but also verifies the quality of both the retrieved documents and the output. Extensive experimental results show that SAM-RAG surpasses existing state-of-the-art methods in both retrieval accuracy and response generation. By further ablation experiments and effectiveness analysis, SAM-RAG maintains high recall quality while improving overall task performance in multimodal RAG task. Our codes are available at https://github.com/SAM-RAG/SAM_RAG.
Alex Buna, Patrick Rebeschini
Recent progress in robust statistical learning has mainly tackled convex problems, like mean estimation or linear regression, with non-convex challenges receiving less attention. Phase retrieval exemplifies such a non-convex problem, requiring the recovery of a signal from only the magnitudes of its linear measurements, without phase (sign) information. While several non-convex methods, especially those involving the Wirtinger Flow algorithm, have been proposed for noiseless or mild noise settings, developing solutions for heavy-tailed noise and adversarial corruption remains an open challenge. In this paper, we investigate an approach that leverages robust gradient descent techniques to improve the Wirtinger Flow algorithm's ability to simultaneously cope with fourth moment bounded noise and adversarial contamination in both the inputs (covariates) and outputs (responses). We address two scenarios: known zero-mean noise and completely unknown noise. For the latter, we propose a preprocessing step that alters the problem into a new format that does not fit traditional phase retrieval approaches but can still be resolved with a tailored version of the algorithm for the zero-mean noise context.
Jian Zhu, Mingkai Sheng, Zhangmin Huang, Jingfei Chang, Jinling Jiang, Jian Long, Cheng Luo, Lei Liu
Multi-modal hashing methods are widely used in multimedia retrieval, which
can fuse multi-source data to generate binary hash code. However, the
individual backbone networks have limited feature expression capabilities and
are not jointly pre-trained on large-scale unsupervised multi-modal data,
resulting in low retrieval accuracy. To address this issue, we propose a novel
CLIP Multi-modal Hashing (CLIPMH) method. Our method employs the CLIP framework
to extract both text and vision features and then fuses them to generate hash
code. Due to enhancement on each modal feature, our method has great
improvement in the retrieval performance of multi-modal hashing methods.
Compared with state-of-the-art unsupervised and supervised multi-modal hashing
methods, experiments reveal that the proposed CLIPMH can significantly improve
performance (a maximum increase of 8.38% in mAP).
Authors' comments: Accepted by 31st International Conference on MultiMedia Modeling
(MMM2025)
Rong Liu, Yongming Qu
The International Council for Harmonisation of Technical Requirements for
Pharmaceuticals for Human Use (ICH) E9 (R1) Addendum provides a framework for
defining estimands in clinical trials. Treatment policy strategy is the mostly
used approach to handle intercurrent events in defining estimands. Imputing
missing values for potential outcomes under the treatment policy strategy has
been discussed in the literature. Missing values as a result of administrative
study withdrawals (such as site closures due to business reasons, COVID-19
control measures, and geopolitical conflicts, etc.) are often imputed in the
same way as other missing values occurring after intercurrent events related to
safety or efficacy. Some research suggests using a hypothetical strategy to
handle the treatment discontinuations due to administrative study withdrawal in
defining the estimands and imputing the missing values based on completer data
assuming missing at random, but this approach ignores the fact that subjects
might experience other intercurrent events had they not had the administrative
study withdrawal. In this article, we consider the administrative study
withdrawal censors the normal real-world like intercurrent events and propose
two methods for handling the corresponding missing values under the retrieved
dropout imputation framework. Simulation shows the two methods perform well. We
also applied the methods to actual clinical trial data evaluating an
anti-diabetes treatment.
Authors' comments: 16 pages, 5 tables, and 2 figures
Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, Chao Huang
Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user needs. However, existing RAG systems have significant limitations, including reliance on flat data representations and inadequate contextual awareness, which can lead to fragmented answers that fail to capture complex inter-dependencies. To address these challenges, we propose LightRAG, which incorporates graph structures into text indexing and retrieval processes. This innovative framework employs a dual-level retrieval system that enhances comprehensive information retrieval from both low-level and high-level knowledge discovery. Additionally, the integration of graph structures with vector representations facilitates efficient retrieval of related entities and their relationships, significantly improving response times while maintaining contextual relevance. This capability is further enhanced by an incremental update algorithm that ensures the timely integration of new data, allowing the system to remain effective and responsive in rapidly changing data environments. Extensive experimental validation demonstrates considerable improvements in retrieval accuracy and efficiency compared to existing approaches. We have made our LightRAG open-source and available at the link: https://github.com/HKUDS/LightRAG
Yijie Ding, Yupeng Hou, Jiacheng Li, Julian McAuley
Generative recommendation (GR) is an emerging paradigm that tokenizes items into discrete tokens and learns to autoregressively generate the next tokens as predictions. Although effective, GR models operate in a transductive setting, meaning they can only generate items seen during training without applying heuristic re-ranking strategies. In this paper, we propose SpecGR, a plug-and-play framework that enables GR models to recommend new items in an inductive setting. SpecGR uses a drafter model with inductive capability to propose candidate items, which may include both existing items and new items. The GR model then acts as a verifier, accepting or rejecting candidates while retaining its strong ranking capabilities. We further introduce the guided re-drafting technique to make the proposed candidates more aligned with the outputs of generative recommendation models, improving the verification efficiency. We consider two variants for drafting: (1) using an auxiliary drafter model for better flexibility, or (2) leveraging the GR model's own encoder for parameter-efficient self-drafting. Extensive experiments on three real-world datasets demonstrate that SpecGR exhibits both strong inductive recommendation ability and the best overall performance among the compared methods. Our code is available at: https://github.com/Jamesding000/SpecGR.
Yotam Intrator, Ori Kelner, Regev Cohen, Roman Goldenberg, Ehud Rivlin, Daniel Freedman
Information retrieval (IR) methods, like retrieval augmented generation, are
fundamental to modern applications but often lack statistical guarantees.
Conformal prediction addresses this by retrieving sets guaranteed to include
relevant information, yet existing approaches produce large-sized sets,
incurring high computational costs and slow response times. In this work, we
introduce a score refinement method that applies a simple monotone
transformation to retrieval scores, leading to significantly smaller conformal
sets while maintaining their statistical guarantees. Experiments on various
BEIR benchmarks validate the effectiveness of our approach in producing compact
sets containing relevant information.
Authors' comments: 6 pages
Chao-Wei Huang, Yun-Nung Chen
Effective information retrieval (IR) from vast datasets relies on advanced
techniques to extract relevant information in response to queries. Recent
advancements in dense retrieval have showcased remarkable efficacy compared to
traditional sparse retrieval methods. To further enhance retrieval performance,
knowledge distillation techniques, often leveraging robust cross-encoder
rerankers, have been extensively explored. However, existing approaches
primarily distill knowledge from pointwise rerankers, which assign absolute
relevance scores to documents, thus facing challenges related to inconsistent
comparisons. This paper introduces Pairwise Relevance Distillation
(PairDistill) to leverage pairwise reranking, offering fine-grained
distinctions between similarly relevant documents to enrich the training of
dense retrieval models. Our experiments demonstrate that PairDistill
outperforms existing methods, achieving new state-of-the-art results across
multiple benchmarks. This highlights the potential of PairDistill in advancing
dense retrieval techniques effectively. Our source code and trained models are
released at https://github.com/MiuLab/PairDistill
Authors' comments: Accepted to EMNLP 2024 Main Conference
Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, Xueqi Cheng
Generative retrieval represents a novel approach to information retrieval. It
uses an encoder-decoder architecture to directly produce relevant document
identifiers (docids) for queries. While this method offers benefits, current
approaches are limited to scenarios with binary relevance data, overlooking the
potential for documents to have multi-graded relevance. Extending generative
retrieval to accommodate multi-graded relevance poses challenges, including the
need to reconcile likelihood probabilities for docid pairs and the possibility
of multiple relevant documents sharing the same identifier. To address these
challenges, we introduce a framework called GRaded Generative Retrieval
(GR$^2$). GR$^2$ focuses on two key components: ensuring relevant and distinct
identifiers, and implementing multi-graded constrained contrastive training.
First, we create identifiers that are both semantically relevant and
sufficiently distinct to represent individual documents effectively. This is
achieved by jointly optimizing the relevance and distinctness of docids through
a combination of docid generation and autoencoder models. Second, we
incorporate information about the relationship between relevance grades to
guide the training process. We use a constrained contrastive training strategy
to bring the representations of queries and the identifiers of their relevant
documents closer together, based on their respective relevance grades.
Extensive experiments on datasets with both multi-graded and binary relevance
demonstrate the effectiveness of GR$^2$.
Authors' comments: Accepted by the NeurIPS 2024 (Spotlight)
Hung-Ting Chen, Eunsol Choi
We study retrieving a set of documents that covers various perspectives on a complex and contentious question (e.g., will ChatGPT do more harm than good?). We curate a Benchmark for Retrieval Diversity for Subjective questions (BERDS), where each example consists of a question and diverse perspectives associated with the question, sourced from survey questions and debate websites. On this data, retrievers paired with a corpus are evaluated to surface a document set that contains diverse perspectives. Our framing diverges from most retrieval tasks in that document relevancy cannot be decided by simple string matches to references. Instead, we build a language model-based automatic evaluator that decides whether each retrieved document contains a perspective. This allows us to evaluate the performance of three different types of corpus (Wikipedia, web snapshot, and corpus constructed on the fly with retrieved pages from the search engine) paired with retrievers. Retrieving diverse documents remains challenging, with the outputs from existing retrievers covering all perspectives on only 40% of the examples. We further study the effectiveness of query expansion and diversity-focused reranking approaches and analyze retriever sycophancy.
Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen
Despite significant progress in multilingual information retrieval, the lack of models capable of effectively supporting multiple languages, particularly low-resource like Indic languages, remains a critical challenge. This paper presents NLLB-E5: A Scalable Multilingual Retrieval Model. NLLB-E5 leverages the in-built multilingual capabilities in the NLLB encoder for translation tasks. It proposes a distillation approach from multilingual retriever E5 to provide a zero-shot retrieval approach handling multiple languages, including all major Indic languages, without requiring multilingual training data. We evaluate the model on a comprehensive suite of existing benchmarks, including Hindi-BEIR, highlighting its robust performance across diverse languages and tasks. Our findings uncover task and domain-specific challenges, providing valuable insights into the retrieval performance, especially for low-resource languages. NLLB-E5 addresses the urgent need for an inclusive, scalable, and language-agnostic text retrieval model, advancing the field of multilingual information access and promoting digital inclusivity for millions of users globally.
Jüri Keller, Timo Breuer, Philipp Schaer
Information Retrieval (IR) systems are exposed to constant changes in most
components. Documents are created, updated, or deleted, the information needs
are changing, and even relevance might not be static. While it is generally
expected that the IR systems retain a consistent utility for the users, test
collection evaluations rely on a fixed experimental setup. Based on the
LongEval shared task and test collection, this work explores how the
effectiveness measured in evolving experiments can be assessed. Specifically,
the persistency of effectiveness is investigated as a replicability task. It is
observed how the effectiveness progressively deteriorates over time compared to
the initial measurement. Employing adapted replicability measures provides
further insight into the persistence of effectiveness. The ranking of systems
varies across retrieval measures and time. In conclusion, it was found that the
most effective systems are not necessarily the ones with the most persistent
performance.
Authors' comments: Experimental IR Meets Multilinguality, Multimodality, and Interaction
- 15th International Conference of the CLEF Association, CLEF 2024, Grenoble,
France, September 9-12, 2024, Proceedings. arXiv admin note: text overlap
with arXiv:2308.10549
Tuba Gokhan, Kexin Wang, Iryna Gurevych, Ted Briscoe
Regulatory documents, issued by governmental regulatory bodies, establish rules, guidelines, and standards that organizations must adhere to for legal compliance. These documents, characterized by their length, complexity and frequent updates, are challenging to interpret, requiring significant allocation of time and expertise on the part of organizations to ensure ongoing compliance. Regulatory Natural Language Processing (RegNLP) is a multidisciplinary field aimed at simplifying access to and interpretation of regulatory rules and obligations. We introduce a task of generating question-passages pairs, where questions are automatically created and paired with relevant regulatory passages, facilitating the development of regulatory question-answering systems. We create the ObliQA dataset, containing 27,869 questions derived from the collection of Abu Dhabi Global Markets (ADGM) financial regulation documents, design a baseline Regulatory Information Retrieval and Answer Generation (RIRAG) system and evaluate it with RePASs, a novel evaluation metric that tests whether generated answers accurately capture all relevant obligations while avoiding contradictions.