Rong Zou, Marc Pollefeys, Denys Rozumnyi
Moving objects are frequently seen in daily life and usually appear blurred in images due to their motion. While general object retrieval is a widely explored area in computer vision, it primarily focuses on sharp and static objects, and retrieval of motion-blurred objects in large image collections remains unexplored. We propose a method for object retrieval in images that are affected by motion blur. The proposed method learns a robust representation capable of matching blurred objects to their deblurred versions and vice versa. To evaluate our approach, we present the first large-scale datasets for blurred object retrieval, featuring images with objects exhibiting varying degrees of blur in various poses and scales. We conducted extensive experiments, showing that our method outperforms state-of-the-art retrieval methods on the new blur-retrieval datasets, which validates the effectiveness of the proposed approach. Code, data, and model are available at https://github.com/Rong-Zou/Retrieval-Robust-to-Object-Motion-Blur.
Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu
Despite the recent progress in long-context language models, it remains
elusive how transformer-based models exhibit the capability to retrieve
relevant information from arbitrary locations within the long context. This
paper aims to address this question. Our systematic investigation across a wide
spectrum of models reveals that a special type of attention heads are largely
responsible for retrieving information, which we dub retrieval heads. We
identify intriguing properties of retrieval heads:(1) universal: all the
explored models with long-context capability have a set of retrieval heads; (2)
sparse: only a small portion (less than 5\%) of the attention heads are
retrieval. (3) intrinsic: retrieval heads already exist in models pretrained
with short context. When extending the context length by continual pretraining,
it is still the same set of heads that perform information retrieval. (4)
dynamically activated: take Llama-2 7B for example, 12 retrieval heads always
attend to the required information no matter how the context is changed. The
rest of the retrieval heads are activated in different contexts. (5) causal:
completely pruning retrieval heads leads to failure in retrieving relevant
information and results in hallucination, while pruning random non-retrieval
heads does not affect the model's retrieval ability. We further show that
retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the
model needs to frequently refer back the question and previously-generated
context. Conversely, tasks where the model directly generates the answer using
its intrinsic knowledge are less impacted by masking out retrieval heads. These
observations collectively explain which internal part of the model seeks
information from the input tokens. We believe our insights will foster future
research on reducing hallucination, improving reasoning, and compressing the KV
cache.
Authors' comments: Preprint
Jiaqi Chen, Daniel Barath, Iro Armeni, Marc Pollefeys, Hermann Blum
Natural language interfaces to embodied AI are becoming more ubiquitous in our daily lives. This opens up further opportunities for language-based interaction with embodied agents, such as a user verbally instructing an agent to execute some task in a specific location. For example, "put the bowls back in the cupboard next to the fridge" or "meet me at the intersection under the red sign." As such, we need methods that interface between natural language and map representations of the environment. To this end, we explore the question of whether we can use an open-set natural language query to identify a scene represented by a 3D scene graph. We define this task as "language-based scene-retrieval" and it is closely related to "coarse-localization," but we are instead searching for a match from a collection of disjoint scenes and not necessarily a large-scale continuous map. We present Text2SceneGraphMatcher, a "scene-retrieval" pipeline that learns joint embeddings between text descriptions and scene graphs to determine if they are a match. The code, trained models, and datasets will be made public.
Sefika Efeoglu, Adrian Paschke
Information Extraction (IE) is a transformative process that converts
unstructured text data into a structured format by employing entity and
relation extraction (RE) methodologies. The identification of the relation
between a pair of entities plays a crucial role within this framework. Despite
the existence of various techniques for relation extraction, their efficacy
heavily relies on access to labeled data and substantial computational
resources. In addressing these challenges, Large Language Models (LLMs) emerge
as promising solutions; however, they might return hallucinating responses due
to their own training data. To overcome these limitations, Retrieved-Augmented
Generation-based Relation Extraction (RAG4RE) in this work is proposed,
offering a pathway to enhance the performance of relation extraction tasks.
This work evaluated the effectiveness of our RAG4RE approach utilizing
different LLMs. Through the utilization of established benchmarks, such as
TACRED, TACREV, Re-TACRED, and SemEval RE datasets, our aim is to
comprehensively evaluate the efficacy of our RAG4RE approach. In particularly,
we leverage prominent LLMs including Flan T5, Llama2, and Mistral in our
investigation. The results of our study demonstrate that our RAG4RE approach
surpasses performance of traditional RE approaches based solely on LLMs,
particularly evident in the TACRED dataset and its variations. Furthermore, our
approach exhibits remarkable performance compared to previous RE methodologies
across both TACRED and TACREV datasets, underscoring its efficacy and potential
for advancing RE tasks in natural language processing.
Authors' comments: Submitted to Semantic Web Journal. Under Review
Haya Nachimovsky, Moshe Tennenholtz, Fiana Raiber, Oren Kurland
Previous work on the competitive retrieval setting focused on a single-query setting: document authors manipulate their documents so as to improve their future ranking for a given query. We study a competitive setting where authors opt to improve their document's ranking for multiple queries. We use game theoretic analysis to prove that equilibrium does not necessarily exist. We then empirically show that it is more difficult for authors to improve their documents' rankings for multiple queries with a neural ranker than with a state-of-the-art feature-based ranker. We also present an effective approach for predicting the document most highly ranked in the next induced ranking.
Lenka Tětková, Teresa Karen Scheidt, Maria Mandrup Fogh, Ellen Marie Gaunby Jørgensen, Finn Årup Nielsen, Lars Kai Hansen
Concept-based explainable AI is promising as a tool to improve the
understanding of complex models at the premises of a given user, viz.\ as a
tool for personalized explainability. An important class of concept-based
explainability methods is constructed with empirically defined concepts,
indirectly defined through a set of positive and negative examples, as in the
TCAV approach (Kim et al., 2018). While it is appealing to the user to avoid
formal definitions of concepts and their operationalization, it can be
challenging to establish relevant concept datasets. Here, we address this
challenge using general knowledge graphs (such as, e.g., Wikidata or WordNet)
for comprehensive concept definition and present a workflow for user-driven
data collection in both text and image domains. The concepts derived from
knowledge graphs are defined interactively, providing an opportunity for
personalization and ensuring that the concepts reflect the user's intentions.
We test the retrieved concept datasets on two concept-based explainability
methods, namely concept activation vectors (CAVs) and concept activation
regions (CARs) (Crabbe and van der Schaar, 2022). We show that CAVs and CARs
based on these empirical concept datasets provide robust and accurate
explanations. Importantly, we also find good alignment between the models'
representations of concepts and the structure of knowledge graphs, i.e., human
representations. This supports our conclusion that knowledge graph-based
concepts are relevant for XAI.
Authors' comments: Preprint. Accepted to The 2nd World Conference on eXplainable
Artificial Intelligence
Chenghao Xiao, G Thomas Hudson, Noura Al Moubayed
Semantic textual similartiy (STS) and information retrieval tasks (IR) tasks have been the two major avenues to record the progress of embedding models in the past few years. Under the emerging Retrieval-augmented Generation (RAG) paradigm, we envision the need to evaluate next-level language understanding abilities of embedding models, and take a conscious look at the reasoning abilities stored in them. Addressing this, we pose the question: Can retrievers solve reasoning problems? By transforming reasoning tasks into retrieval tasks, we find that without specifically trained for reasoning-level language understanding, current state-of-the-art retriever models may still be far from being competent for playing the role of assisting LLMs, especially in reasoning-intensive tasks. Moreover, albeit trained to be aware of instructions, instruction-aware IR models are often better off without instructions in inference time for reasoning tasks, posing an overlooked retriever-LLM behavioral gap for the research community to align. However, recent decoder-based embedding models show great promise in narrowing the gap, highlighting the pathway for embedding models to achieve reasoning-level language understanding. We also show that, although current off-the-shelf re-ranker models fail on these tasks, injecting reasoning abilities into them through fine-tuning still appears easier than doing so to bi-encoders, and we are able to achieve state-of-the-art performance across all tasks by fine-tuning a reranking model. We release Reasoning as Retrieval Benchmark (RAR-b), a holistic suite of tasks and settings to evaluate the reasoning abilities stored in retriever models. RAR-b is available at https://github.com/gowitheflow-1998/RAR-b.
Yanan Zhang, Xiaoling Bai, Tianhua Zhou
The embedding-based retrieval (EBR) approach is widely used in mainstream
search engine retrieval systems and is crucial in recent retrieval-augmented
methods for eliminating LLM illusions. However, existing EBR models often face
the "semantic drift" problem and insufficient focus on key information, leading
to a low adoption rate of retrieval results in subsequent steps. This issue is
especially noticeable in real-time search scenarios, where the various
expressions of popular events on the Internet make real-time retrieval heavily
reliant on crucial event information. To tackle this problem, this paper
proposes a novel approach called EER, which enhances real-time retrieval
performance by improving the dual-encoder model of traditional EBR. We
incorporate contrastive learning to accompany pairwise learning for encoder
optimization. Furthermore, to strengthen the focus on critical event
information in events, we include a decoder module after the document encoder,
introduce a generative event triplet extraction scheme based on prompt-tuning,
and correlate the events with query encoder optimization through comparative
learning. This decoder module can be removed during inference. Extensive
experiments demonstrate that EER can significantly improve the real-time search
retrieval performance. We believe that this approach will provide new
perspectives in the field of information retrieval. The codes and dataset are
available at https://github.com/open-event-hub/Event-enhanced_Retrieval .
Authors' comments: LREC-COLING 2024
Jooyeon Kim, Eulrang Cho, Sehyung Kim, Hyunwoo J. Kim
Open-vocabulary object detection (OVD) has been studied with Vision-Language
Models (VLMs) to detect novel objects beyond the pre-trained categories.
Previous approaches improve the generalization ability to expand the knowledge
of the detector, using 'positive' pseudo-labels with additional 'class' names,
e.g., sock, iPod, and alligator. To extend the previous methods in two aspects,
we propose Retrieval-Augmented Losses and visual Features (RALF). Our method
retrieves related 'negative' classes and augments loss functions. Also, visual
features are augmented with 'verbalized concepts' of classes, e.g., worn on the
feet, handheld music player, and sharp teeth. Specifically, RALF consists of
two modules: Retrieval Augmented Losses (RAL) and Retrieval-Augmented visual
Features (RAF). RAL constitutes two losses reflecting the semantic similarity
with negative vocabularies. In addition, RAF augments visual features with the
verbalized concepts from a large language model (LLM). Our experiments
demonstrate the effectiveness of RALF on COCO and LVIS benchmark datasets. We
achieve improvement up to 3.4 box AP$_{50}^{\text{N}}$ on novel categories of
the COCO dataset and 3.6 mask AP$_{\text{r}}$ gains on the LVIS dataset. Code
is available at https://github.com/mlvlab/RALF .
Authors' comments: Accepted paper at CVPR 2024
Pouria Rouzrokh, Shahriar Faghani, Cooper U. Gamble, Moein Shariatnia, Bradley J. Erickson
Retrieval-augmented generation (RAG) frameworks enable large language models
(LLMs) to retrieve relevant information from a knowledge base and incorporate
it into the context for generating responses. This mitigates hallucinations and
allows for the updating of knowledge without retraining the LLM. However, RAG
does not guarantee valid responses if retrieval fails to identify the necessary
information as the context for response generation. Also, if there is
contradictory content, the RAG response will likely reflect only one of the two
possible responses. Therefore, quantifying uncertainty in the retrieval process
is crucial for ensuring RAG trustworthiness. In this report, we introduce a
four-step framework for applying conformal prediction to quantify retrieval
uncertainty in RAG frameworks. First, a calibration set of questions answerable
from the knowledge base is constructed. Each question's embedding is compared
against document embeddings to identify the most relevant document chunks
containing the answer and record their similarity scores. Given a
user-specified error rate ({\alpha}), these similarity scores are then analyzed
to determine a similarity score cutoff threshold. During inference, all chunks
with similarity exceeding this threshold are retrieved to provide context to
the LLM, ensuring the true answer is captured in the context with a
(1-{\alpha}) confidence level. We provide a Python package that enables users
to implement the entire workflow proposed in our work, only using LLMs and
without human intervention.
Authors' comments: Github code:
https://github.com/Mayo-Radiology-Informatics-Lab/conflare
Aleksandr V. Petrov, Sean MacAvaney, Craig Macdonald
Transformer-based Cross-Encoders achieve state-of-the-art effectiveness in
text retrieval. However, Cross-Encoders based on large transformer models (such
as BERT or T5) are computationally expensive and allow for scoring only a small
number of documents within a reasonably small latency window. However, keeping
search latencies low is important for user satisfaction and energy usage. In
this paper, we show that weaker shallow transformer models (i.e., transformers
with a limited number of layers) actually perform better than full-scale models
when constrained to these practical low-latency settings since they can
estimate the relevance of more documents in the same time budget. We further
show that shallow transformers may benefit from the generalized Binary
Cross-Entropy (gBCE) training scheme, which has recently demonstrated success
for recommendation tasks. Our experiments with TREC Deep Learning passage
ranking query sets demonstrate significant improvements in shallow and
full-scale models in low-latency scenarios. For example, when the latency limit
is 25ms per query, MonoBERT-Large (a cross-encoder based on a full-scale BERT
model) is only able to achieve NDCG@10 of 0.431 on TREC DL 2019, while
TinyBERT-gBCE (a cross-encoder based on TinyBERT trained with gBCE) reaches
NDCG@10 of 0.652, a +51% gain over MonoBERT-Large. We also show that shallow
Cross-Encoders are effective even when used without a GPU (e.g., with CPU
inference, NDCG@10 decreases only by 3% compared to GPU inference with 50ms
latency), which makes Cross-Encoders practical to run even without specialized
hardware acceleration.
Authors' comments: Accepted by ECIR2024
Shengjie Liu, Jing Wu, Jingyuan Bao, Wenyi Wang, Naira Hovakimyan, Christopher G Healey
This paper describes an investigation of the robustness of large language models (LLMs) for retrieval augmented generation (RAG)-based summarization tasks. While LLMs provide summarization capabilities, their performance in complex, real-world scenarios remains under-explored. Our first contribution is LogicSumm, an innovative evaluation framework incorporating realistic scenarios to assess LLM robustness during RAG-based summarization. Based on limitations identified by LogiSumm, we then developed SummRAG, a comprehensive system to create training dialogues and fine-tune a model to enhance robustness within LogicSumm's scenarios. SummRAG is an example of our goal of defining structured methods to test the capabilities of an LLM, rather than addressing issues in a one-off fashion. Experimental results confirm the power of SummRAG, showcasing improved logical coherence and summarization quality. Data, corresponding model weights, and Python code are available online.
Seonho Kim, Kiryung Lee
We consider a least absolute deviation (LAD) approach to the robust phase retrieval problem that aims to recover a signal from its absolute measurements corrupted with sparse noise. To solve the resulting non-convex optimization problem, we propose a robust alternating minimization (Robust-AM) derived as an unconstrained Gauss-Newton method. To solve the inner optimization arising in each step of Robust-AM, we adopt two computationally efficient methods for linear programs. We provide a non-asymptotic convergence analysis of these practical algorithms for Robust-AM under the standard Gaussian measurement assumption. These algorithms, when suitably initialized, are guaranteed to converge linearly to the ground truth at an order-optimal sample complexity with high probability while the support of sparse noise is arbitrarily fixed and the sparsity level is no larger than $1/4$. Additionally, through comprehensive numerical experiments on synthetic and image datasets, we show that Robust-AM outperforms existing methods for robust phase retrieval offering comparable theoretical performance
Aditya Golatkar, Alessandro Achille, Luca Zancato, Yu-Xiang Wang, Ashwin Swaminathan, Stefano Soatto
Retrieval Augmented Generation (RAG) is emerging as a flexible and robust
technique to adapt models to private users data without training, to handle
credit attribution, and to allow efficient machine unlearning at scale.
However, RAG techniques for image generation may lead to parts of the retrieved
samples being copied in the model's output. To reduce risks of leaking private
information contained in the retrieved set, we introduce Copy-Protected
generation with Retrieval (CPR), a new method for RAG with strong copyright
protection guarantees in a mixed-private setting for diffusion models.CPR
allows to condition the output of diffusion models on a set of retrieved
images, while also guaranteeing that unique identifiable information about
those example is not exposed in the generated outputs. In particular, it does
so by sampling from a mixture of public (safe) distribution and private (user)
distribution by merging their diffusion scores at inference. We prove that CPR
satisfies Near Access Freeness (NAF) which bounds the amount of information an
attacker may be able to extract from the generated images. We provide two
algorithms for copyright protection, CPR-KL and CPR-Choose. Unlike previously
proposed rejection-sampling-based NAF methods, our methods enable efficient
copyright-protected sampling with a single run of backward diffusion. We show
that our method can be applied to any pre-trained conditional diffusion model,
such as Stable Diffusion or unCLIP. In particular, we empirically show that
applying CPR on top of unCLIP improves quality and text-to-image alignment of
the generated results (81.4 to 83.17 on TIFA benchmark), while enabling credit
attribution, copy-right protection, and deterministic, constant time,
unlearning.
Authors' comments: CVPR 2024
I. Chalendar, J. R. Partington
Let $f$ and $g$ be analytic functions on the open unit disc $\mathbb D$ such
that $|f|=|g|$ on a set $A$. We first prove that there exists $c$ in the unit
circle $\mathbb T$ such that $f=cg$ when $A$ is the union of two lines in
$\mathbb D$ intersecting at an angle that is an irrational multiple of $\pi$.
The same conclusion is valid when $f$ and $g$ are in the Nevanlinna class and
$A$ is the union of the unit circle and an interior circle, tangential or not.
We also provide sequential versions of the previous results and analyse the
case $A=r\mathbb T$. Finally we examine the situation when there is equality on
two distinct circles in the disc, proving a result or counterexample for each
possible configuration.
Authors' comments: 13 pages, 1 figure
Tingyu Lin, Robert Sablatnig
In analyzing vast amounts of digitally stored historical image data, existing content-based retrieval methods often overlook significant non-semantic information, limiting their effectiveness for flexible exploration across varied themes. To broaden the applicability of image retrieval methods for diverse purposes and uncover more general patterns, we innovatively introduce a crucial factor from computational aesthetics, namely image composition, into this topic. By explicitly integrating composition-related information extracted by CNN into the designed retrieval model, our method considers both the image's composition rules and semantic information. Qualitative and quantitative experiments demonstrate that the image retrieval network guided by composition information outperforms those relying solely on content information, facilitating the identification of images in databases closer to the target image in human perception. Please visit https://github.com/linty5/CCBIR to try our codes.
Ruizhe Zhang, Qingyao Ai, Ziyi Ye, Yueyue Wu, Xiaohui Xie, Yiqun Liu
The tasks of legal case retrieval have received growing attention from the IR
community in the last decade. Relevance feedback techniques with implicit user
feedback (e.g., clicks) have been demonstrated to be effective in traditional
search tasks (e.g., Web search). In legal case retrieval, however, collecting
relevance feedback faces a couple of challenges that are difficult to resolve
under existing feedback paradigms. First, legal case retrieval is a complex
task as users often need to understand the relationship between legal cases in
detail to correctly judge their relevance. Traditional feedback signal such as
clicks is too coarse to use as they do not reflect any fine-grained relevance
information. Second, legal case documents are usually long, users often need
even tens of minutes to read and understand them. Simple behavior signal such
as clicks and eye-tracking fixations can hardly be useful when users almost
click and examine every part of the document. In this paper, we explore the
possibility of solving the feedback problem in legal case retrieval with brain
signal. Recent advances in brain signal processing have shown that human
emotional can be collected in fine grains through Brain-Machine Interfaces
(BMI) without interrupting the users in their tasks. Therefore, we propose a
framework for legal case retrieval that uses EEG signal to optimize retrieval
results. We collected and create a legal case retrieval dataset with users EEG
signal and propose several methods to extract effective EEG features for
relevance feedback. Our proposed features achieve a 71% accuracy for feedback
prediction with an SVM-RFE model, and our proposed ranking method that takes
into account the diverse needs of users can significantly improve user
satisfaction for legal case retrieval. Experiment results show that re-ranked
result list make user more satisfied.
Authors' comments: 11pages, 8 figures
Ayush Thakur, Rashmi Vashisth
This paper presents Loops On Retrieval Augmented Generation (LoRAG), a new framework designed to enhance the quality of retrieval-augmented text generation through the incorporation of an iterative loop mechanism. The architecture integrates a generative model, a retrieval mechanism, and a dynamic loop module, allowing for iterative refinement of the generated text through interactions with relevant information retrieved from the input context. Experimental evaluations on benchmark datasets demonstrate that LoRAG surpasses existing state-of-the-art models in terms of BLEU score, ROUGE score, and perplexity, showcasing its effectiveness in achieving both coherence and relevance in generated text. The qualitative assessment further illustrates LoRAG's capability to produce contextually rich and coherent outputs. This research contributes valuable insights into the potential of iterative loops in mitigating challenges in text generation, positioning LoRAG as a promising advancement in the field.
Nazanin Dehghan, Alessio D'Errico, Francesco Di Colandrea, Ebrahim Karimi
The complete measurement of the quantum state of two correlated photons requires reconstructing the amplitude and phase of the biphoton wavefunction. We show how, by means of spatially resolved single photon detection, one can infer the spatial structure of bi-photons generated by spontaneous parametric down conversion. In particular, a spatially resolved analysis of the second-order correlations allows us to isolate the moduli of the pump and phasematching contributions to the two-photon states. When carrying this analysis on different propagation planes, the free space propagation of pump and phasematching is observed. This result allows, in principle, to gain enough information to reconstruct also the phase of pump and phasematching, and thus the full biphoton wavefunction. We show this in different examples where the pump is shaped as a superposition of orbital angular momentum modes or as a smooth amplitude with a phase structure with no singularities. The corresponding phase structure is retrieved employing maximum likelihood or genetic algorithms. These findings have potential applications in fast, efficient quantum state characterisation that does not require any control over the source.
Huimin Zeng, Zhenrui Yue, Qian Jiang, Dong Wang
Federated Recommendation (FR) emerges as a novel paradigm that enables privacy-preserving recommendations. However, traditional FR systems usually represent users/items with discrete identities (IDs), suffering from performance degradation due to the data sparsity and heterogeneity in FR. On the other hand, Large Language Models (LLMs) as recommenders have proven effective across various recommendation scenarios. Yet, LLM-based recommenders encounter challenges such as low inference efficiency and potential hallucination, compromising their performance in real-world scenarios. To this end, we propose GPT-FedRec, a federated recommendation framework leveraging ChatGPT and a novel hybrid Retrieval Augmented Generation (RAG) mechanism. GPT-FedRec is a two-stage solution. The first stage is a hybrid retrieval process, mining ID-based user patterns and text-based item features. Next, the retrieved results are converted into text prompts and fed into GPT for re-ranking. Our proposed hybrid retrieval mechanism and LLM-based re-rank aims to extract generalized features from data and exploit pretrained knowledge within LLM, overcoming data sparsity and heterogeneity in FR. In addition, the RAG approach also prevents LLM hallucination, improving the recommendation performance for real-world users. Experimental results on diverse benchmark datasets demonstrate the superior performance of GPT-FedRec against state-of-the-art baseline methods.