Abhishek Divekar, Greg Durrett
Large language models (LLMs) are versatile and can address many tasks, but for computational efficiency, it is often desirable to distill their capabilities into smaller student models. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is "seeded" with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to standard 32-shot prompting and six baseline approaches.
Dominik Farhan
Entity linking (EL) is the computational process of connecting textual
mentions to corresponding entities. Like many areas of natural language
processing, the EL field has greatly benefited from deep learning, leading to
significant performance improvements. However, present-day approaches are
expensive to train and rely on diverse data sources, complicating their
reproducibility. In this thesis, we develop multiple systems that are fast to
train, demonstrating that competitive entity linking can be achieved without a
large GPU cluster. Moreover, we train on a publicly available dataset, ensuring
reproducibility and accessibility. Our models are evaluated for 9 languages
giving an accurate overview of their strengths. Furthermore, we offer
a~detailed analysis of bi-encoder training hyperparameters, a popular approach
in EL, to guide their informed selection. Overall, our work shows that building
competitive neural network based EL systems that operate in multiple languages
is possible even with limited resources, thus making EL more approachable.
Authors' comments: Bachelor's thesis, Charles University
Hossein A. Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, Daniel Campos
Test collections play a vital role in evaluation of information retrieval
(IR) systems. Obtaining a diverse set of user queries for test collection
construction can be challenging, and acquiring relevance judgments, which
indicate the appropriateness of retrieved documents to a query, is often costly
and resource-intensive. Generating synthetic datasets using Large Language
Models (LLMs) has recently gained significant attention in various
applications. In IR, while previous work exploited the capabilities of LLMs to
generate synthetic queries or documents to augment training data and improve
the performance of ranking models, using LLMs for constructing synthetic test
collections is relatively unexplored. Previous studies demonstrate that LLMs
have the potential to generate synthetic relevance judgments for use in the
evaluation of IR systems. In this paper, we comprehensively investigate whether
it is possible to use LLMs to construct fully synthetic test collections by
generating not only synthetic judgments but also synthetic queries. In
particular, we analyse whether it is possible to construct reliable synthetic
test collections and the potential risks of bias such test collections may
exhibit towards LLM-based models. Our experiments indicate that using LLMs it
is possible to construct synthetic test collections that can reliably be used
for retrieval evaluation.
Authors' comments: SIGIR 2024
Lazaro Janier Gonzalez-Soler, Maciej Salwowski, Christian Rathgeb, Daniel Fischer
Tattoos have been used effectively as soft biometrics to assist law
enforcement in the identification of offenders and victims, as they contain
discriminative information, and are a useful indicator to locate members of a
criminal gang or organisation. Due to various privacy issues in the acquisition
of images containing tattoos, only a limited number of databases exists. This
lack of databases has delayed the development of new methods to effectively
retrieve a potential suspect's tattoo images from a candidate gallery. To
mitigate this issue, in our work, we use an unsupervised generative approach to
create a balanced database consisting of 28,550 semi-synthetic images with
tattooed subjects from 571 tattoo categories. Further, we introduce a novel
Tattoo Template Reconstruction Network (TattTRN), which learns to map the input
tattoo sample to its respective tattoo template to enhance the distinguishing
attributes of the final feature embedding. Experimental results with real data,
i.e., WebTattoo and BIVTatt databases, demonstrate the soundness of the
presented approach: an accuracy of up to 99% is achieved for checking at most
the first 20 entries of the candidate list.
Authors' comments: Accepted at CVPR Workshop 2024
Juhwan Lee, Jisu Kim
This study addresses the hallucination problem in large language models (LLMs). We adopted Retrieval-Augmented Generation(RAG) (Lewis et al., 2020), a technique that involves embedding relevant information in the prompt to obtain accurate answers. However, RAG also faced inherent issues in retrieving correct information. To address this, we employed the Dense Passage Retrieval(DPR) (Karpukhin et al., 2020) model for fetching domain-specific documents related to user queries. Despite this, the DPR model still lacked accuracy in document retrieval. We enhanced the DPR model by incorporating control tokens, achieving significantly superior performance over the standard DPR model, with a 13% improvement in Top-1 accuracy and a 4% improvement in Top-20 accuracy.
Xin Du, Lixin Xiu, Kumiko Tanaka-Ishii
We apply an information-theoretic perspective to reconsider generative
document retrieval (GDR), in which a document $x \in X$ is indexed by $t \in
T$, and a neural autoregressive model is trained to map queries $Q$ to $T$. GDR
can be considered to involve information transmission from documents $X$ to
queries $Q$, with the requirement to transmit more bits via the indexes $T$. By
applying Shannon's rate-distortion theory, the optimality of indexing can be
analyzed in terms of the mutual information, and the design of the indexes $T$
can then be regarded as a {\em bottleneck} in GDR. After reformulating GDR from
this perspective, we empirically quantify the bottleneck underlying GDR.
Finally, using the NQ320K and MARCO datasets, we evaluate our proposed
bottleneck-minimal indexing method in comparison with various previous indexing
methods, and we show that it outperforms those methods.
Authors' comments: Accepted for ICML 2024
Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, Zhaofeng Liu
Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative models through external information retrieval. Evaluating these RAG systems, however, poses unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. To better understand these challenges, we conduct A Unified Evaluation Process of RAG (Auepora) and aim to provide a comprehensive overview of the evaluation and benchmarks of RAG systems. Specifically, we examine and compare several quantifiable metrics of the Retrieval and Generation components, such as relevance, accuracy, and faithfulness, within the current RAG benchmarks, encompassing the possible output and ground truth pairs. We then analyze the various datasets and metrics, discuss the limitations of current benchmarks, and suggest potential directions to advance the field of RAG benchmarks.
Yong Guan, Dingxiao Liu, Jinchen Ma, Hao Peng, Xiaozhi Wang, Lei Hou, Ru Li
Generative document retrieval, an emerging paradigm in information retrieval,
learns to build connections between documents and identifiers within a single
model, garnering significant attention. However, there are still two
challenges: (1) neglecting inner-content correlation during document
representation; (2) lacking explicit semantic structure during identifier
construction. Nonetheless, events have enriched relations and well-defined
taxonomy, which could facilitate addressing the above two challenges. Inspired
by this, we propose Event GDR, an event-centric generative document retrieval
model, integrating event knowledge into this task. Specifically, we utilize an
exchange-then-reflection method based on multi-agents for event knowledge
extraction. For document representation, we employ events and relations to
model the document to guarantee the comprehensiveness and inner-content
correlation. For identifier construction, we map the events to well-defined
event taxonomy to construct the identifiers with explicit semantic structure.
Our method achieves significant improvement over the baselines on two datasets,
and also hopes to provide insights for future research.
Authors' comments: Accepted to WWW 2024
Eugene Yang
High Recall Retrieval (HRR), such as eDiscovery and medical systematic
review, is a search problem that optimizes the cost of retrieving most relevant
documents in a given collection. Iterative approaches, such as iterative
relevance feedback and uncertainty sampling, are shown to be effective under
various operational scenarios. Despite neural models demonstrating success in
other text-related tasks, linear models such as logistic regression, in
general, are still more effective and efficient in HRR since the model is
trained and retrieves documents from the same fixed collection. In this work,
we leverage SPLADE, an efficient retrieval model that transforms documents into
contextualized sparse vectors, for HRR. Our approach combines the best of both
worlds, leveraging both the contextualization from pretrained language models
and the efficiency of linear models. It reduces 10% and 18% of the review cost
in two HRR evaluation collections under a one-phase review workflow with a
target recall of 80%. The experiment is implemented with TARexp and is
available at https://github.com/eugene-yang/LSR-for-TAR.
Authors' comments: 5 pages, 1 figure, accepted at SIGIR 2024 as short paper
Eugene Yang, Thomas Jänich, James Mayfield, Dawn Lawrie
Multilingual information retrieval (MLIR) considers the problem of ranking
documents in several languages for a query expressed in a language that may
differ from any of those languages. Recent work has observed that approaches
such as combining ranked lists representing a single document language each or
using multilingual pretrained language models demonstrate a preference for one
language over others. This results in systematic unfair treatment of documents
in different languages. This work proposes a language fairness metric to
evaluate whether documents across different languages are fairly ranked through
statistical equivalence testing using the Kruskal-Wallis test. In contrast to
most prior work in group fairness, we do not consider any language to be an
unprotected group. Thus our proposed measure, PEER (Probability of
EqualExpected Rank), is the first fairness metric specifically designed to
capture the language fairness of MLIR systems. We demonstrate the behavior of
PEER on artificial ranked lists. We also evaluate real MLIR systems on two
publicly available benchmarks and show that the PEER scores align with prior
analytical findings on MLIR fairness. Our implementation is compatible with
ir-measures and is available at http://github.com/hltcoe/peer_measure.
Authors' comments: 5 pages, 1 figure, accepted at SIGIR 2024 as short paper
Hao-Cheng Lo, Jung-Mei Chu, Jieh Hsiang, Chun-Chieh Cho
In patent prosecution, image-based retrieval systems for identifying
similarities between current patent images and prior art are pivotal to ensure
the novelty and non-obviousness of patent applications. Despite their growing
popularity in recent years, existing attempts, while effective at recognizing
images within the same patent, fail to deliver practical value due to their
limited generalizability in retrieving relevant prior art. Moreover, this task
inherently involves the challenges posed by the abstract visual features of
patent images, the skewed distribution of image classifications, and the
semantic information of image descriptions. Therefore, we propose a
language-informed, distribution-aware multimodal approach to patent image
feature learning, which enriches the semantic understanding of patent image by
integrating Large Language Models and improves the performance of
underrepresented classes with our proposed distribution-aware contrastive
losses. Extensive experiments on DeepPatent2 dataset show that our proposed
method achieves state-of-the-art or comparable performance in image-based
patent retrieval with mAP +53.3%, Recall@10 +41.8%, and MRR@10 +51.9%.
Furthermore, through an in-depth user analysis, we explore our model in aiding
patent professionals in their image retrieval efforts, highlighting the model's
real-world applicability and effectiveness.
Authors' comments: 8 pages. Under review
Jiabao Wang, Yang Wu, Jun Wang, Ni Chen
The multi-plane phase retrieval method provides a budget-friendly and effective way to perform phase imaging, yet it often encounters alignment challenges due to shifts along the optical axis in experiments. Traditional methods, such as employing beamsplitters instead of mechanical stage movements or adjusting focus using tunable light sources, add complexity to the setup required for multi-plane phase retrieval. Attempts to address these issues computationally face difficulties due to the variable impact of diffraction, which renders conventional homography techniques inadequate. In our research, we introduce a novel Adaptive Cascade Calibrated (ACC) strategy for multi-plane phase retrieval that overcomes misalignment issues. This technique detects feature points within the refocused sample space and calculates the transformation matrix for neighboring planes on-the-fly to digitally adjust measurements, facilitating alignment-free multi-plane phase retrieval. This approach not only avoids the need for complex and expensive optical hardware but also simplifies the imaging setup, reducing overall costs. The effectiveness of our method is validated through simulations and real-world optical experiments.
Rong Zou, Marc Pollefeys, Denys Rozumnyi
Moving objects are frequently seen in daily life and usually appear blurred in images due to their motion. While general object retrieval is a widely explored area in computer vision, it primarily focuses on sharp and static objects, and retrieval of motion-blurred objects in large image collections remains unexplored. We propose a method for object retrieval in images that are affected by motion blur. The proposed method learns a robust representation capable of matching blurred objects to their deblurred versions and vice versa. To evaluate our approach, we present the first large-scale datasets for blurred object retrieval, featuring images with objects exhibiting varying degrees of blur in various poses and scales. We conducted extensive experiments, showing that our method outperforms state-of-the-art retrieval methods on the new blur-retrieval datasets, which validates the effectiveness of the proposed approach. Code, data, and model are available at https://github.com/Rong-Zou/Retrieval-Robust-to-Object-Motion-Blur.
Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu
Despite the recent progress in long-context language models, it remains
elusive how transformer-based models exhibit the capability to retrieve
relevant information from arbitrary locations within the long context. This
paper aims to address this question. Our systematic investigation across a wide
spectrum of models reveals that a special type of attention heads are largely
responsible for retrieving information, which we dub retrieval heads. We
identify intriguing properties of retrieval heads:(1) universal: all the
explored models with long-context capability have a set of retrieval heads; (2)
sparse: only a small portion (less than 5\%) of the attention heads are
retrieval. (3) intrinsic: retrieval heads already exist in models pretrained
with short context. When extending the context length by continual pretraining,
it is still the same set of heads that perform information retrieval. (4)
dynamically activated: take Llama-2 7B for example, 12 retrieval heads always
attend to the required information no matter how the context is changed. The
rest of the retrieval heads are activated in different contexts. (5) causal:
completely pruning retrieval heads leads to failure in retrieving relevant
information and results in hallucination, while pruning random non-retrieval
heads does not affect the model's retrieval ability. We further show that
retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the
model needs to frequently refer back the question and previously-generated
context. Conversely, tasks where the model directly generates the answer using
its intrinsic knowledge are less impacted by masking out retrieval heads. These
observations collectively explain which internal part of the model seeks
information from the input tokens. We believe our insights will foster future
research on reducing hallucination, improving reasoning, and compressing the KV
cache.
Authors' comments: Preprint
Jiaqi Chen, Daniel Barath, Iro Armeni, Marc Pollefeys, Hermann Blum
Natural language interfaces to embodied AI are becoming more ubiquitous in our daily lives. This opens up further opportunities for language-based interaction with embodied agents, such as a user verbally instructing an agent to execute some task in a specific location. For example, "put the bowls back in the cupboard next to the fridge" or "meet me at the intersection under the red sign." As such, we need methods that interface between natural language and map representations of the environment. To this end, we explore the question of whether we can use an open-set natural language query to identify a scene represented by a 3D scene graph. We define this task as "language-based scene-retrieval" and it is closely related to "coarse-localization," but we are instead searching for a match from a collection of disjoint scenes and not necessarily a large-scale continuous map. We present Text2SceneGraphMatcher, a "scene-retrieval" pipeline that learns joint embeddings between text descriptions and scene graphs to determine if they are a match. The code, trained models, and datasets will be made public.
Sefika Efeoglu, Adrian Paschke
Information Extraction (IE) is a transformative process that converts
unstructured text data into a structured format by employing entity and
relation extraction (RE) methodologies. The identification of the relation
between a pair of entities plays a crucial role within this framework. Despite
the existence of various techniques for relation extraction, their efficacy
heavily relies on access to labeled data and substantial computational
resources. In addressing these challenges, Large Language Models (LLMs) emerge
as promising solutions; however, they might return hallucinating responses due
to their own training data. To overcome these limitations, Retrieved-Augmented
Generation-based Relation Extraction (RAG4RE) in this work is proposed,
offering a pathway to enhance the performance of relation extraction tasks.
This work evaluated the effectiveness of our RAG4RE approach utilizing
different LLMs. Through the utilization of established benchmarks, such as
TACRED, TACREV, Re-TACRED, and SemEval RE datasets, our aim is to
comprehensively evaluate the efficacy of our RAG4RE approach. In particularly,
we leverage prominent LLMs including Flan T5, Llama2, and Mistral in our
investigation. The results of our study demonstrate that our RAG4RE approach
surpasses performance of traditional RE approaches based solely on LLMs,
particularly evident in the TACRED dataset and its variations. Furthermore, our
approach exhibits remarkable performance compared to previous RE methodologies
across both TACRED and TACREV datasets, underscoring its efficacy and potential
for advancing RE tasks in natural language processing.
Authors' comments: Submitted to Semantic Web Journal. Under Review
Haya Nachimovsky, Moshe Tennenholtz, Fiana Raiber, Oren Kurland
Previous work on the competitive retrieval setting focused on a single-query setting: document authors manipulate their documents so as to improve their future ranking for a given query. We study a competitive setting where authors opt to improve their document's ranking for multiple queries. We use game theoretic analysis to prove that equilibrium does not necessarily exist. We then empirically show that it is more difficult for authors to improve their documents' rankings for multiple queries with a neural ranker than with a state-of-the-art feature-based ranker. We also present an effective approach for predicting the document most highly ranked in the next induced ranking.
Lenka Tětková, Teresa Karen Scheidt, Maria Mandrup Fogh, Ellen Marie Gaunby Jørgensen, Finn Årup Nielsen, Lars Kai Hansen
Concept-based explainable AI is promising as a tool to improve the
understanding of complex models at the premises of a given user, viz.\ as a
tool for personalized explainability. An important class of concept-based
explainability methods is constructed with empirically defined concepts,
indirectly defined through a set of positive and negative examples, as in the
TCAV approach (Kim et al., 2018). While it is appealing to the user to avoid
formal definitions of concepts and their operationalization, it can be
challenging to establish relevant concept datasets. Here, we address this
challenge using general knowledge graphs (such as, e.g., Wikidata or WordNet)
for comprehensive concept definition and present a workflow for user-driven
data collection in both text and image domains. The concepts derived from
knowledge graphs are defined interactively, providing an opportunity for
personalization and ensuring that the concepts reflect the user's intentions.
We test the retrieved concept datasets on two concept-based explainability
methods, namely concept activation vectors (CAVs) and concept activation
regions (CARs) (Crabbe and van der Schaar, 2022). We show that CAVs and CARs
based on these empirical concept datasets provide robust and accurate
explanations. Importantly, we also find good alignment between the models'
representations of concepts and the structure of knowledge graphs, i.e., human
representations. This supports our conclusion that knowledge graph-based
concepts are relevant for XAI.
Authors' comments: Preprint. Accepted to The 2nd World Conference on eXplainable
Artificial Intelligence
Chenghao Xiao, G Thomas Hudson, Noura Al Moubayed
Semantic textual similartiy (STS) and information retrieval tasks (IR) tasks have been the two major avenues to record the progress of embedding models in the past few years. Under the emerging Retrieval-augmented Generation (RAG) paradigm, we envision the need to evaluate next-level language understanding abilities of embedding models, and take a conscious look at the reasoning abilities stored in them. Addressing this, we pose the question: Can retrievers solve reasoning problems? By transforming reasoning tasks into retrieval tasks, we find that without specifically trained for reasoning-level language understanding, current state-of-the-art retriever models may still be far from being competent for playing the role of assisting LLMs, especially in reasoning-intensive tasks. Moreover, albeit trained to be aware of instructions, instruction-aware IR models are often better off without instructions in inference time for reasoning tasks, posing an overlooked retriever-LLM behavioral gap for the research community to align. However, recent decoder-based embedding models show great promise in narrowing the gap, highlighting the pathway for embedding models to achieve reasoning-level language understanding. We also show that, although current off-the-shelf re-ranker models fail on these tasks, injecting reasoning abilities into them through fine-tuning still appears easier than doing so to bi-encoders, and we are able to achieve state-of-the-art performance across all tasks by fine-tuning a reranking model. We release Reasoning as Retrieval Benchmark (RAR-b), a holistic suite of tasks and settings to evaluate the reasoning abilities stored in retriever models. RAR-b is available at https://github.com/gowitheflow-1998/RAR-b.
Yanan Zhang, Xiaoling Bai, Tianhua Zhou
The embedding-based retrieval (EBR) approach is widely used in mainstream
search engine retrieval systems and is crucial in recent retrieval-augmented
methods for eliminating LLM illusions. However, existing EBR models often face
the "semantic drift" problem and insufficient focus on key information, leading
to a low adoption rate of retrieval results in subsequent steps. This issue is
especially noticeable in real-time search scenarios, where the various
expressions of popular events on the Internet make real-time retrieval heavily
reliant on crucial event information. To tackle this problem, this paper
proposes a novel approach called EER, which enhances real-time retrieval
performance by improving the dual-encoder model of traditional EBR. We
incorporate contrastive learning to accompany pairwise learning for encoder
optimization. Furthermore, to strengthen the focus on critical event
information in events, we include a decoder module after the document encoder,
introduce a generative event triplet extraction scheme based on prompt-tuning,
and correlate the events with query encoder optimization through comparative
learning. This decoder module can be removed during inference. Extensive
experiments demonstrate that EER can significantly improve the real-time search
retrieval performance. We believe that this approach will provide new
perspectives in the field of information retrieval. The codes and dataset are
available at https://github.com/open-event-hub/Event-enhanced_Retrieval .
Authors' comments: LREC-COLING 2024