Jooyeon Kim, Eulrang Cho, Sehyung Kim, Hyunwoo J. Kim
Open-vocabulary object detection (OVD) has been studied with Vision-Language
Models (VLMs) to detect novel objects beyond the pre-trained categories.
Previous approaches improve the generalization ability to expand the knowledge
of the detector, using 'positive' pseudo-labels with additional 'class' names,
e.g., sock, iPod, and alligator. To extend the previous methods in two aspects,
we propose Retrieval-Augmented Losses and visual Features (RALF). Our method
retrieves related 'negative' classes and augments loss functions. Also, visual
features are augmented with 'verbalized concepts' of classes, e.g., worn on the
feet, handheld music player, and sharp teeth. Specifically, RALF consists of
two modules: Retrieval Augmented Losses (RAL) and Retrieval-Augmented visual
Features (RAF). RAL constitutes two losses reflecting the semantic similarity
with negative vocabularies. In addition, RAF augments visual features with the
verbalized concepts from a large language model (LLM). Our experiments
demonstrate the effectiveness of RALF on COCO and LVIS benchmark datasets. We
achieve improvement up to 3.4 box AP$_{50}^{\text{N}}$ on novel categories of
the COCO dataset and 3.6 mask AP$_{\text{r}}$ gains on the LVIS dataset. Code
is available at https://github.com/mlvlab/RALF .
Authors' comments: Accepted paper at CVPR 2024
Pouria Rouzrokh, Shahriar Faghani, Cooper U. Gamble, Moein Shariatnia, Bradley J. Erickson
Retrieval-augmented generation (RAG) frameworks enable large language models
(LLMs) to retrieve relevant information from a knowledge base and incorporate
it into the context for generating responses. This mitigates hallucinations and
allows for the updating of knowledge without retraining the LLM. However, RAG
does not guarantee valid responses if retrieval fails to identify the necessary
information as the context for response generation. Also, if there is
contradictory content, the RAG response will likely reflect only one of the two
possible responses. Therefore, quantifying uncertainty in the retrieval process
is crucial for ensuring RAG trustworthiness. In this report, we introduce a
four-step framework for applying conformal prediction to quantify retrieval
uncertainty in RAG frameworks. First, a calibration set of questions answerable
from the knowledge base is constructed. Each question's embedding is compared
against document embeddings to identify the most relevant document chunks
containing the answer and record their similarity scores. Given a
user-specified error rate ({\alpha}), these similarity scores are then analyzed
to determine a similarity score cutoff threshold. During inference, all chunks
with similarity exceeding this threshold are retrieved to provide context to
the LLM, ensuring the true answer is captured in the context with a
(1-{\alpha}) confidence level. We provide a Python package that enables users
to implement the entire workflow proposed in our work, only using LLMs and
without human intervention.
Authors' comments: Github code:
https://github.com/Mayo-Radiology-Informatics-Lab/conflare
Aleksandr V. Petrov, Sean MacAvaney, Craig Macdonald
Transformer-based Cross-Encoders achieve state-of-the-art effectiveness in
text retrieval. However, Cross-Encoders based on large transformer models (such
as BERT or T5) are computationally expensive and allow for scoring only a small
number of documents within a reasonably small latency window. However, keeping
search latencies low is important for user satisfaction and energy usage. In
this paper, we show that weaker shallow transformer models (i.e., transformers
with a limited number of layers) actually perform better than full-scale models
when constrained to these practical low-latency settings since they can
estimate the relevance of more documents in the same time budget. We further
show that shallow transformers may benefit from the generalized Binary
Cross-Entropy (gBCE) training scheme, which has recently demonstrated success
for recommendation tasks. Our experiments with TREC Deep Learning passage
ranking query sets demonstrate significant improvements in shallow and
full-scale models in low-latency scenarios. For example, when the latency limit
is 25ms per query, MonoBERT-Large (a cross-encoder based on a full-scale BERT
model) is only able to achieve NDCG@10 of 0.431 on TREC DL 2019, while
TinyBERT-gBCE (a cross-encoder based on TinyBERT trained with gBCE) reaches
NDCG@10 of 0.652, a +51% gain over MonoBERT-Large. We also show that shallow
Cross-Encoders are effective even when used without a GPU (e.g., with CPU
inference, NDCG@10 decreases only by 3% compared to GPU inference with 50ms
latency), which makes Cross-Encoders practical to run even without specialized
hardware acceleration.
Authors' comments: Accepted by ECIR2024
Shengjie Liu, Jing Wu, Jingyuan Bao, Wenyi Wang, Naira Hovakimyan, Christopher G Healey
This paper describes an investigation of the robustness of large language models (LLMs) for retrieval augmented generation (RAG)-based summarization tasks. While LLMs provide summarization capabilities, their performance in complex, real-world scenarios remains under-explored. Our first contribution is LogicSumm, an innovative evaluation framework incorporating realistic scenarios to assess LLM robustness during RAG-based summarization. Based on limitations identified by LogiSumm, we then developed SummRAG, a comprehensive system to create training dialogues and fine-tune a model to enhance robustness within LogicSumm's scenarios. SummRAG is an example of our goal of defining structured methods to test the capabilities of an LLM, rather than addressing issues in a one-off fashion. Experimental results confirm the power of SummRAG, showcasing improved logical coherence and summarization quality. Data, corresponding model weights, and Python code are available online.
Seonho Kim, Kiryung Lee
We consider a least absolute deviation (LAD) approach to the robust phase retrieval problem that aims to recover a signal from its absolute measurements corrupted with sparse noise. To solve the resulting non-convex optimization problem, we propose a robust alternating minimization (Robust-AM) derived as an unconstrained Gauss-Newton method. To solve the inner optimization arising in each step of Robust-AM, we adopt two computationally efficient methods for linear programs. We provide a non-asymptotic convergence analysis of these practical algorithms for Robust-AM under the standard Gaussian measurement assumption. These algorithms, when suitably initialized, are guaranteed to converge linearly to the ground truth at an order-optimal sample complexity with high probability while the support of sparse noise is arbitrarily fixed and the sparsity level is no larger than $1/4$. Additionally, through comprehensive numerical experiments on synthetic and image datasets, we show that Robust-AM outperforms existing methods for robust phase retrieval offering comparable theoretical performance
Aditya Golatkar, Alessandro Achille, Luca Zancato, Yu-Xiang Wang, Ashwin Swaminathan, Stefano Soatto
Retrieval Augmented Generation (RAG) is emerging as a flexible and robust
technique to adapt models to private users data without training, to handle
credit attribution, and to allow efficient machine unlearning at scale.
However, RAG techniques for image generation may lead to parts of the retrieved
samples being copied in the model's output. To reduce risks of leaking private
information contained in the retrieved set, we introduce Copy-Protected
generation with Retrieval (CPR), a new method for RAG with strong copyright
protection guarantees in a mixed-private setting for diffusion models.CPR
allows to condition the output of diffusion models on a set of retrieved
images, while also guaranteeing that unique identifiable information about
those example is not exposed in the generated outputs. In particular, it does
so by sampling from a mixture of public (safe) distribution and private (user)
distribution by merging their diffusion scores at inference. We prove that CPR
satisfies Near Access Freeness (NAF) which bounds the amount of information an
attacker may be able to extract from the generated images. We provide two
algorithms for copyright protection, CPR-KL and CPR-Choose. Unlike previously
proposed rejection-sampling-based NAF methods, our methods enable efficient
copyright-protected sampling with a single run of backward diffusion. We show
that our method can be applied to any pre-trained conditional diffusion model,
such as Stable Diffusion or unCLIP. In particular, we empirically show that
applying CPR on top of unCLIP improves quality and text-to-image alignment of
the generated results (81.4 to 83.17 on TIFA benchmark), while enabling credit
attribution, copy-right protection, and deterministic, constant time,
unlearning.
Authors' comments: CVPR 2024
I. Chalendar, J. R. Partington
Let $f$ and $g$ be analytic functions on the open unit disc $\mathbb D$ such
that $|f|=|g|$ on a set $A$. We first prove that there exists $c$ in the unit
circle $\mathbb T$ such that $f=cg$ when $A$ is the union of two lines in
$\mathbb D$ intersecting at an angle that is an irrational multiple of $\pi$.
The same conclusion is valid when $f$ and $g$ are in the Nevanlinna class and
$A$ is the union of the unit circle and an interior circle, tangential or not.
We also provide sequential versions of the previous results and analyse the
case $A=r\mathbb T$. Finally we examine the situation when there is equality on
two distinct circles in the disc, proving a result or counterexample for each
possible configuration.
Authors' comments: 13 pages, 1 figure
Tingyu Lin, Robert Sablatnig
In analyzing vast amounts of digitally stored historical image data, existing content-based retrieval methods often overlook significant non-semantic information, limiting their effectiveness for flexible exploration across varied themes. To broaden the applicability of image retrieval methods for diverse purposes and uncover more general patterns, we innovatively introduce a crucial factor from computational aesthetics, namely image composition, into this topic. By explicitly integrating composition-related information extracted by CNN into the designed retrieval model, our method considers both the image's composition rules and semantic information. Qualitative and quantitative experiments demonstrate that the image retrieval network guided by composition information outperforms those relying solely on content information, facilitating the identification of images in databases closer to the target image in human perception. Please visit https://github.com/linty5/CCBIR to try our codes.
Ruizhe Zhang, Qingyao Ai, Ziyi Ye, Yueyue Wu, Xiaohui Xie, Yiqun Liu
The tasks of legal case retrieval have received growing attention from the IR
community in the last decade. Relevance feedback techniques with implicit user
feedback (e.g., clicks) have been demonstrated to be effective in traditional
search tasks (e.g., Web search). In legal case retrieval, however, collecting
relevance feedback faces a couple of challenges that are difficult to resolve
under existing feedback paradigms. First, legal case retrieval is a complex
task as users often need to understand the relationship between legal cases in
detail to correctly judge their relevance. Traditional feedback signal such as
clicks is too coarse to use as they do not reflect any fine-grained relevance
information. Second, legal case documents are usually long, users often need
even tens of minutes to read and understand them. Simple behavior signal such
as clicks and eye-tracking fixations can hardly be useful when users almost
click and examine every part of the document. In this paper, we explore the
possibility of solving the feedback problem in legal case retrieval with brain
signal. Recent advances in brain signal processing have shown that human
emotional can be collected in fine grains through Brain-Machine Interfaces
(BMI) without interrupting the users in their tasks. Therefore, we propose a
framework for legal case retrieval that uses EEG signal to optimize retrieval
results. We collected and create a legal case retrieval dataset with users EEG
signal and propose several methods to extract effective EEG features for
relevance feedback. Our proposed features achieve a 71% accuracy for feedback
prediction with an SVM-RFE model, and our proposed ranking method that takes
into account the diverse needs of users can significantly improve user
satisfaction for legal case retrieval. Experiment results show that re-ranked
result list make user more satisfied.
Authors' comments: 11pages, 8 figures
Ayush Thakur, Rashmi Vashisth
This paper presents Loops On Retrieval Augmented Generation (LoRAG), a new framework designed to enhance the quality of retrieval-augmented text generation through the incorporation of an iterative loop mechanism. The architecture integrates a generative model, a retrieval mechanism, and a dynamic loop module, allowing for iterative refinement of the generated text through interactions with relevant information retrieved from the input context. Experimental evaluations on benchmark datasets demonstrate that LoRAG surpasses existing state-of-the-art models in terms of BLEU score, ROUGE score, and perplexity, showcasing its effectiveness in achieving both coherence and relevance in generated text. The qualitative assessment further illustrates LoRAG's capability to produce contextually rich and coherent outputs. This research contributes valuable insights into the potential of iterative loops in mitigating challenges in text generation, positioning LoRAG as a promising advancement in the field.
Nazanin Dehghan, Alessio D'Errico, Francesco Di Colandrea, Ebrahim Karimi
The complete measurement of the quantum state of two correlated photons requires reconstructing the amplitude and phase of the biphoton wavefunction. We show how, by means of spatially resolved single photon detection, one can infer the spatial structure of bi-photons generated by spontaneous parametric down conversion. In particular, a spatially resolved analysis of the second-order correlations allows us to isolate the moduli of the pump and phasematching contributions to the two-photon states. When carrying this analysis on different propagation planes, the free space propagation of pump and phasematching is observed. This result allows, in principle, to gain enough information to reconstruct also the phase of pump and phasematching, and thus the full biphoton wavefunction. We show this in different examples where the pump is shaped as a superposition of orbital angular momentum modes or as a smooth amplitude with a phase structure with no singularities. The corresponding phase structure is retrieved employing maximum likelihood or genetic algorithms. These findings have potential applications in fast, efficient quantum state characterisation that does not require any control over the source.
Huimin Zeng, Zhenrui Yue, Qian Jiang, Dong Wang
Federated Recommendation (FR) emerges as a novel paradigm that enables privacy-preserving recommendations. However, traditional FR systems usually represent users/items with discrete identities (IDs), suffering from performance degradation due to the data sparsity and heterogeneity in FR. On the other hand, Large Language Models (LLMs) as recommenders have proven effective across various recommendation scenarios. Yet, LLM-based recommenders encounter challenges such as low inference efficiency and potential hallucination, compromising their performance in real-world scenarios. To this end, we propose GPT-FedRec, a federated recommendation framework leveraging ChatGPT and a novel hybrid Retrieval Augmented Generation (RAG) mechanism. GPT-FedRec is a two-stage solution. The first stage is a hybrid retrieval process, mining ID-based user patterns and text-based item features. Next, the retrieved results are converted into text prompts and fed into GPT for re-ranking. Our proposed hybrid retrieval mechanism and LLM-based re-rank aims to extract generalized features from data and exploit pretrained knowledge within LLM, overcoming data sparsity and heterogeneity in FR. In addition, the RAG approach also prevents LLM hallucination, improving the recommendation performance for real-world users. Experimental results on diverse benchmark datasets demonstrate the superior performance of GPT-FedRec against state-of-the-art baseline methods.
Rose E. Wang, Pawan Wirawarn, Omar Khattab, Noah Goodman, Dorottya Demszky
Many online content portals allow users to ask questions to supplement their
understanding (e.g., of lectures). While information retrieval (IR) systems may
provide answers for such user queries, they do not directly assist content
creators -- such as lecturers who want to improve their content -- identify
segments that _caused_ a user to ask those questions. We introduce the task of
backtracing, in which systems retrieve the text segment that most likely caused
a user query. We formalize three real-world domains for which backtracing is
important in improving content delivery and communication: understanding the
cause of (a) student confusion in the Lecture domain, (b) reader curiosity in
the News Article domain, and (c) user emotion in the Conversation domain. We
evaluate the zero-shot performance of popular information retrieval methods and
language modeling methods, including bi-encoder, re-ranking and
likelihood-based methods and ChatGPT. While traditional IR systems retrieve
semantically relevant information (e.g., details on "projection matrices" for a
query "does projecting multiple times still lead to the same point?"), they
often miss the causally relevant context (e.g., the lecturer states "projecting
twice gets me the same answer as one projection"). Our results show that there
is room for improvement on backtracing and it requires new retrieval
approaches. We hope our benchmark serves to improve future retrieval systems
for backtracing, spawning systems that refine content generation and identify
linguistic triggers influencing user queries. Our code and data are
open-sourced: https://github.com/rosewang2008/backtracing.
Authors' comments: Code: https://github.com/rosewang2008/backtracing; EACL 2024
Findings, Long Paper
Antonio Francesco Mello, Guglielmo Lami, Mario Collura
Quantum computing's promise lies in its intrinsic complexity, with
entanglement initially heralded as its hallmark. However, the quest for quantum
advantage extends beyond entanglement, encompassing the realm of nonstabilizer
(magic) states. Despite their significance, quantifying and characterizing
these states pose formidable challenges. Here, we introduce a novel approach
leveraging Convolutional Neural Networks (CNNs) to classify quantum states
based on their magic content. Without relying on a complete knowledge of the
state, we utilize partial information acquired from measurement snapshots to
train the CNN in distinguishing between stabilizer and nonstabilizer states.
Importantly, our methodology circumvents the limitations of full state
tomography, offering a practical solution for real-world quantum experiments.
In addition, we unveil a theoretical connection between Stabilizer R\'enyi
Entropies (SREs) and the expectation value of Pauli matrices for pure quantum
states. Our findings pave the way for experimental applications, providing a
robust and accessible tool for deciphering the intricate landscape of quantum
resources.
Authors' comments: 7 pages, 4 figures
Hui Wu, Min Wang, Wengang Zhou, Zhenbo Lu, Houqiang Li
In asymmetric retrieval systems, models with different capacities are deployed on platforms with different computational and storage resources. Despite the great progress, existing approaches still suffer from a dilemma between retrieval efficiency and asymmetric accuracy due to the limited capacity of the lightweight query model. In this work, we propose an Asymmetric Feature Fusion (AFF) paradigm, which advances existing asymmetric retrieval systems by considering the complementarity among different features just at the gallery side. Specifically, it first embeds each gallery image into various features, e.g., local features and global features. Then, a dynamic mixer is introduced to aggregate these features into compact embedding for efficient search. On the query side, only a single lightweight model is deployed for feature extraction. The query model and dynamic mixer are jointly trained by sharing a momentum-updated classifier. Notably, the proposed paradigm boosts the accuracy of asymmetric retrieval without introducing any extra overhead to the query side. Exhaustive experiments on various landmark retrieval datasets demonstrate the superiority of our paradigm.
Tom Hosking, Hao Tang, Mirella Lapata
We propose a method for unsupervised abstractive opinion summarization, that
combines the attributability and scalability of extractive approaches with the
coherence and fluency of Large Language Models (LLMs). Our method, HIRO, learns
an index structure that maps sentences to a path through a semantically
organized discrete hierarchy. At inference time, we populate the index and use
it to identify and retrieve clusters of sentences containing popular opinions
from input reviews. Then, we use a pretrained LLM to generate a readable
summary that is grounded in these extracted evidential clusters. The modularity
of our approach allows us to evaluate its efficacy at each stage. We show that
HIRO learns an encoding space that is more semantically structured than prior
work, and generates summaries that are more representative of the opinions in
the input reviews. Human evaluation confirms that HIRO generates significantly
more coherent, detailed and accurate summaries.
Authors' comments: Accepted to TACL; Pre MIT Press version
Pierre Erbacher, Jian-Yun Nie, Philippe Preux, Laure Soulier
Conversational systems have made significant progress in generating natural language responses. However, their potential as conversational search systems is currently limited due to their passive role in the information-seeking process. One major limitation is the scarcity of datasets that provide labelled ambiguous questions along with a supporting corpus of documents and relevant clarifying questions. This work aims to tackle the challenge of generating relevant clarifying questions by taking into account the inherent ambiguities present in both user queries and documents. To achieve this, we propose PAQA, an extension to the existing AmbiNQ dataset, incorporating clarifying questions. We then evaluate various models and assess how passage retrieval impacts ambiguity detection and the generation of clarifying questions. By addressing this gap in conversational search systems, we aim to provide additional supervision to enhance their active participation in the information-seeking process and provide users with more accurate results.
Jianyou Wang, Kaicheng Wang, Xiaoyue Wang, Weili Cao, Ramamohan Paturi, Leon Bergen
Effective information retrieval (IR) in settings with limited training data,
particularly for complex queries, remains a challenging task. This paper
introduces IR2, Information Regularization for Information Retrieval, a
technique for reducing overfitting during synthetic data generation. This
approach, representing a novel application of regularization techniques in
synthetic data creation for IR, is tested on three recent IR tasks
characterized by complex queries: DORIS-MAE, ArguAna, and WhatsThatBook.
Experimental results indicate that our regularization techniques not only
outperform previous synthetic query generation methods on the tasks considered
but also reduce cost by up to 50%. Furthermore, this paper categorizes and
explores three regularization methods at different stages of the query
synthesis pipeline-input, prompt, and output-each offering varying degrees of
performance improvement compared to models where no regularization is applied.
This provides a systematic approach for optimizing synthetic data generation in
data-limited, complex-query IR scenarios. All code, prompts and synthetic data
are available at
https://github.com/Info-Regularization/Information-Regularization.
Authors' comments: Accepted by LREC-COLING 2024 - The 2024 Joint International
Conference on Computational Linguistics, Language Resources and Evaluation
Seraphina Goldfarb-Tarrant, Pedro Rodriguez, Jane Dwivedi-Yu, Patrick Lewis
Dense retrievers compress source documents into (possibly lossy) vector representations, yet there is little analysis of what information is lost versus preserved, and how it affects downstream tasks. We conduct the first analysis of the information captured by dense retrievers compared to the language models they are based on (e.g., BERT versus Contriever). We use 25 MultiBert checkpoints as randomized initialisations to train MultiContrievers, a set of 25 contriever models. We test whether specific pieces of information -- such as gender and occupation -- can be extracted from contriever vectors of wikipedia-like documents. We measure this extractability via information theoretic probing. We then examine the relationship of extractability to performance and gender bias, as well as the sensitivity of these results to many random initialisations and data shuffles. We find that (1) contriever models have significantly increased extractability, but extractability usually correlates poorly with benchmark performance 2) gender bias is present, but is not caused by the contriever representations 3) there is high sensitivity to both random initialisation and to data shuffle, suggesting that future retrieval research should test across a wider spread of both.
Danyang Hou, Liang Pang, Huawei Shen, Xueqi Cheng
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task
focused on identifying a specific moment within a vast corpus of untrimmed
videos using the natural language query. Existing methods for VCMR typically
rely on frame-aware video retrieval, calculating similarities between the query
and video frames to rank videos based on maximum frame similarity.However, this
approach overlooks the semantic structure embedded within the information
between frames, namely, the event, a crucial element for human comprehension of
videos. Motivated by this, we propose EventFormer, a model that explicitly
utilizes events within videos as fundamental units for video retrieval. The
model extracts event representations through event reasoning and hierarchical
event encoding. The event reasoning module groups consecutive and visually
similar frame representations into events, while the hierarchical event
encoding encodes information at both the frame and event levels. We also
introduce anchor multi-head self-attenion to encourage Transformer to capture
the relevance of adjacent content in the video. The training of EventFormer is
conducted by two-branch contrastive learning and dual optimization for two
sub-tasks of VCMR. Extensive experiments on TVR, ANetCaps, and DiDeMo
benchmarks show the effectiveness and efficiency of EventFormer in VCMR,
achieving new state-of-the-art results. Additionally, the effectiveness of
EventFormer is also validated on partially relevant video retrieval task.
Authors' comments: 11 pages, 5 figures, 9 tables