Yixiao Yang, Ran Tao, Kaixuan Wei, Jun Shi
The realm of classical phase retrieval concerns itself with the arduous task of recovering a signal from its Fourier magnitude measurements, which are fraught with inherent ambiguities. A single-exposure intensity measurement is commonly deemed insufficient for the reconstruction of the primal signal, given that the absent phase component is imperative for the inverse transformation. In this work, we present a novel single-shot phase retrieval paradigm from a fractional Fourier transform (FrFT) perspective, which involves integrating the FrFT-based physical measurement model within a self-supervised reconstruction scheme. Specifically, the proposed FrFT-based measurement model addresses the aliasing artifacts problem in the numerical calculation of Fresnel diffraction, featuring adaptability to both short-distance and long-distance propagation scenarios. Moreover, the intensity measurement in the FrFT domain proves highly effective in alleviating the ambiguities of phase retrieval and relaxing the previous conditions on oversampled or multiple measurements in the Fourier domain. Furthermore, the proposed self-supervised reconstruction approach harnesses the fast discrete algorithm of FrFT alongside untrained neural network priors, thereby attaining preeminent results. Through numerical simulations, we demonstrate that both amplitude and phase objects can be effectively retrieved from a single-shot intensity measurement using the proposed approach and provide a promising technique for support-free coherent diffraction imaging.
Yan Di, Chenyangguang Zhang, Chaowei Wang, Ruida Zhang, Guangyao Zhai, Yanyan Li, Bowen Fu, Xiangyang Ji et al.
In this paper, we present ShapeMatcher, a unified self-supervised learning
framework for joint shape canonicalization, segmentation, retrieval and
deformation. Given a partially-observed object in an arbitrary pose, we first
canonicalize the object by extracting point-wise affine-invariant features,
disentangling inherent structure of the object with its pose and size. These
learned features are then leveraged to predict semantically consistent part
segmentation and corresponding part centers. Next, our lightweight retrieval
module aggregates the features within each part as its retrieval token and
compare all the tokens with source shapes from a pre-established database to
identify the most geometrically similar shape. Finally, we deform the retrieved
shape in the deformation module to tightly fit the input object by harnessing
part center guided neural cage deformation. The key insight of ShapeMaker is
the simultaneous training of the four highly-associated processes:
canonicalization, segmentation, retrieval, and deformation, leveraging
cross-task consistency losses for mutual supervision. Extensive experiments on
synthetic datasets PartNet, ComplementMe, and real-world dataset Scan2CAD
demonstrate that ShapeMaker surpasses competitors by a large margin.
Authors' comments: CVPR2024
Nicolas Jonason, Luca Casini, Carl Thomé, Bob L. T. Sturm
We explore the use of large language models (LLMs) for music generation using
a retrieval system to select relevant examples. We find promising initial
results for music generation in a dialogue with the user, especially
considering the ease with which such a system can be implemented. The code is
available online.
Authors' comments: LBD @ ISMIR 2023
Aman Sinha, Priyanshu Raj Mall, Dwaipayan Roy
The accessibility of documents within a collection holds a pivotal role in
Information Retrieval, signifying the ease of locating specific content in a
collection of documents. This accessibility can be achieved via two distinct
avenues. The first is through some retrieval model using a keyword or other
feature-based search, and the other is where a document can be navigated using
links associated with them, if available. Metrics such as PageRank, Hub, and
Authority illuminate the pathways through which documents can be discovered
within the network of content while the concept of Retrievability is used to
quantify the ease with which a document can be found by a retrieval model. In
this paper, we compare these two perspectives, PageRank and retrievability, as
they quantify the importance and discoverability of content in a corpus.
Through empirical experimentation on benchmark datasets, we demonstrate a
subtle similarity between retrievability and PageRank particularly
distinguishable for larger datasets.
Authors' comments: Accepted at FIRE 2023
Jon Saad-Falcon, Omar Khattab, Christopher Potts, Matei Zaharia
Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. Using synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. To mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI). Across six different knowledge-intensive tasks in KILT and SuperGLUE, ARES accurately evaluates RAG systems while using a few hundred human annotations during evaluation. Furthermore, ARES judges remain effective across domain shifts, proving accurate even after changing the type of queries and/or documents used in the evaluated RAG systems. We make our datasets and code for replication and deployment available at https://github.com/stanford-futuredata/ARES.
Ting-Rui Chiang, Xinyan Velocity Yu, Joshua Robinson, Ollie Liu, Isabelle Lee, Dani Yogatama
Augmenting a language model (LM) with $k$-nearest neighbors ($k$NN) retrieval
on its training data alone can decrease its perplexity, though the underlying
reasons for this remain elusive. In this work, we rule out one previously
posited possibility -- the "softmax bottleneck." We then create a new dataset
to evaluate LM generalization ability in the setting where training data
contains additional information that is not causally relevant. This task is
challenging even for GPT-3.5 Turbo. We show that, for both GPT-2 and Mistral
7B, $k$NN retrieval augmentation consistently improves performance in this
setting. Finally, to make $k$NN retrieval more accessible, we propose using a
multi-layer perceptron model that maps datastore keys to values as a drop-in
replacement for traditional retrieval. This reduces storage costs by over 25x.
Authors' comments: Accepted to NAACL 2024
Yunah Jang, Kang-il Lee, Hyunkyung Bae, Hwanhee Lee, Kyomin Jung
Conversational search aims to retrieve passages containing essential information to answer queries in a multi-turn conversation. In conversational search, reformulating context-dependent conversational queries into stand-alone forms is imperative to effectively utilize off-the-shelf retrievers. Previous methodologies for conversational query reformulation frequently depend on human-annotated rewrites. However, these manually crafted queries often result in sub-optimal retrieval performance and require high collection costs. To address these challenges, we propose Iterative Conversational Query Reformulation (IterCQR), a methodology that conducts query reformulation without relying on human rewrites. IterCQR iteratively trains the conversational query reformulation (CQR) model by directly leveraging information retrieval (IR) signals as a reward. Our IterCQR training guides the CQR model such that generated queries contain necessary information from the previous dialogue context. Our proposed method shows state-of-the-art performance on two widely-used datasets, demonstrating its effectiveness on both sparse and dense retrievers. Moreover, IterCQR exhibits superior performance in challenging settings such as generalization on unseen datasets and low-resource scenarios.
Pritom Saha Akash, Kashob Kumar Roy, Lucian Popa, Kevin Chen-Chuan Chang
Long-form question answering (LFQA) poses a challenge as it involves generating detailed answers in the form of paragraphs, which go beyond simple yes/no responses or short factual answers. While existing QA models excel in questions with concise answers, LFQA requires handling multiple topics and their intricate relationships, demanding comprehensive explanations. Previous attempts at LFQA focused on generating long-form answers by utilizing relevant contexts from a corpus, relying solely on the question itself. However, they overlooked the possibility that the question alone might not provide sufficient information to identify the relevant contexts. Additionally, generating detailed long-form answers often entails aggregating knowledge from diverse sources. To address these limitations, we propose an LFQA model with iterative Planning, Retrieval, and Generation. This iterative process continues until a complete answer is generated for the given question. From an extensive experiment on both an open domain and a technical domain QA dataset, we find that our model outperforms the state-of-the-art models on various textual and factual metrics for the LFQA task.
Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, Dong Yu
Retrieval-augmented language models (RALMs) represent a substantial
advancement in the capabilities of large language models, notably in reducing
factual hallucination by leveraging external knowledge sources. However, the
reliability of the retrieved information is not always guaranteed. The
retrieval of irrelevant data can lead to misguided responses, and potentially
causing the model to overlook its inherent knowledge, even when it possesses
adequate information to address the query. Moreover, standard RALMs often
struggle to assess whether they possess adequate knowledge, both intrinsic and
retrieved, to provide an accurate answer. In situations where knowledge is
lacking, these systems should ideally respond with "unknown" when the answer is
unattainable. In response to these challenges, we introduces Chain-of-Noting
(CoN), a novel approach aimed at improving the robustness of RALMs in facing
noisy, irrelevant documents and in handling unknown scenarios. The core idea of
CoN is to generate sequential reading notes for retrieved documents, enabling a
thorough evaluation of their relevance to the given question and integrating
this information to formulate the final answer. We employed ChatGPT to create
training data for CoN, which was subsequently trained on an LLaMa-2 7B model.
Our experiments across four open-domain QA benchmarks show that RALMs equipped
with CoN significantly outperform standard RALMs. Notably, CoN achieves an
average improvement of +7.9 in EM score given entirely noisy retrieved
documents and +10.5 in rejection rates for real-time questions that fall
outside the pre-training knowledge scope.
Authors' comments: EMNLP 2024 (main conference)
Tamir Bendory, Nadav Dym, Dan Edidin, Arun Suresh
The key ingredient to retrieving a signal from its Fourier magnitudes, namely, to solve the phase retrieval problem, is an effective prior on the sought signal. In this paper, we study the phase retrieval problem under the prior that the signal lies in a semi-algebraic set. This is a very general prior as semi-algebraic sets include linear models, sparse models, and ReLU neural network generative models. The latter is the main motivation of this paper, due to the remarkable success of deep generative models in a variety of imaging tasks, including phase retrieval. We prove that almost all signals in R^N can be determined from their Fourier magnitudes, up to a sign, if they lie in a (generic) semi-algebraic set of dimension N/2. The same is true for all signals if the semi-algebraic set is of dimension N/4. We also generalize these results to the problem of signal recovery from the second moment in multi-reference alignment models with multiplicity free representations of compact groups. This general result is then used to derive improved sample complexity bounds for recovering band-limited functions on the sphere from their noisy copies, each acted upon by a random element of SO(3).
Yuichi Sasazawa, Kenichi Yokote, Osamu Imaichi, Yasuhiro Sogawa
The text retrieval is the task of retrieving similar documents to a search query, and it is important to improve retrieval accuracy while maintaining a certain level of retrieval speed. Existing studies have reported accuracy improvements using language models, but many of these do not take into account the reduction in search speed that comes with increased performance. In this study, we propose three-stage re-ranking model using model ensembles or larger language models to improve search accuracy while minimizing the search delay. We ranked the documents by BM25 and language models, and then re-ranks by a model ensemble or a larger language model for documents with high similarity to the query. In our experiments, we train the MiniLM language model on the MS-MARCO dataset and evaluate it in a zero-shot setting. Our proposed method achieves higher retrieval accuracy while reducing the retrieval speed decay.
Konstantin Yakovlev, Gregory Polyakov, Ilseyar Alimova, Alexander Podolskiy, Andrey Bout, Sergey Nikolenko, Irina Piontkovskaya
A recent trend in multimodal retrieval is related to postprocessing test set
results via the dual-softmax loss (DSL). While this approach can bring
significant improvements, it usually presumes that an entire matrix of test
samples is available as DSL input. This work introduces a new postprocessing
approach based on Sinkhorn transformations that outperforms DSL. Further, we
propose a new postprocessing setting that does not require access to multiple
test queries. We show that our approach can significantly improve the results
of state of the art models such as CLIP4Clip, BLIP, X-CLIP, and DRL, thus
achieving a new state-of-the-art on several standard text-video retrieval
datasets both with access to the entire test set and in the single-query
setting.
Authors' comments: SIGIR 2023
Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, Graham Neubig
On-the-fly retrieval of relevant knowledge has proven an essential element of reliable systems for tasks such as open-domain question answering and fact verification. However, because retrieval systems are not perfect, generation models are required to generate outputs given partially or entirely irrelevant passages. This can cause over- or under-reliance on context, and result in problems in the generated output such as hallucinations. To alleviate these problems, we propose FILCO, a method that improves the quality of the context provided to the generator by (1) identifying useful context based on lexical and information-theoretic approaches, and (2) training context filtering models that can filter retrieved contexts at test time. We experiment on six knowledge-intensive tasks with FLAN-T5 and LLaMa2, and demonstrate that our method outperforms existing approaches on extractive question answering (QA), complex multi-hop and long-form QA, fact verification, and dialog generation tasks. FILCO effectively improves the quality of context, whether or not it supports the canonical output.
Sai Muralidhar Jayanthi, Devang Kulshreshtha, Saket Dingliwal, Srikanth Ronanki, Sravan Bodapati
Personalization of automatic speech recognition (ASR) models is a widely
studied topic because of its many practical applications. Most recently,
attention-based contextual biasing techniques are used to improve the
recognition of rare words and domain specific entities. However, due to
performance constraints, the biasing is often limited to a few thousand
entities, restricting real-world usability. To address this, we first propose a
"Retrieve and Copy" mechanism to improve latency while retaining the accuracy
even when scaled to a large catalog. We also propose a training strategy to
overcome the degradation in recall at such scale due to an increased number of
confusing entities. Overall, our approach achieves up to 6% more Word Error
Rate reduction (WERR) and 3.6% absolute improvement in F1 when compared to a
strong baseline. Our method also allows for large catalog sizes of up to 20K
without significantly affecting WER and F1-scores, while achieving at least 20%
inference speedup per acoustic frame.
Authors' comments: EMNLP 2023
Jingbiao Mei, Jinghong Chen, Weizhe Lin, Bill Byrne, Marcus Tomalin
Hateful memes have emerged as a significant concern on the Internet.
Detecting hateful memes requires the system to jointly understand the visual
and textual modalities. Our investigation reveals that the embedding space of
existing CLIP-based systems lacks sensitivity to subtle differences in memes
that are vital for correct hatefulness classification. We propose constructing
a hatefulness-aware embedding space through retrieval-guided contrastive
training. Our approach achieves state-of-the-art performance on the
HatefulMemes dataset with an AUROC of 87.0, outperforming much larger
fine-tuned large multimodal models. We demonstrate a retrieval-based hateful
memes detection system, which is capable of identifying hatefulness based on
data unseen in training. This allows developers to update the hateful memes
detection system by simply adding new examples without retraining, a desirable
feature for real services in the constantly evolving landscape of hateful memes
on the Internet.
Authors' comments: ACL 2024 Main. The code is available from:
https://github.com/JingbiaoMei/RGCL
Haoxin Li, Daniel Cheng, Phillip Keung, Jungo Kasai, Noah A. Smith
Generative retrieval (Wang et al., 2022; Tay et al., 2022) is a popular
approach for end-to-end document retrieval that directly generates document
identifiers given an input query. We introduce summarization-based document
IDs, in which each document's ID is composed of an extractive summary or
abstractive keyphrases generated by a language model, rather than an integer ID
sequence or bags of n-grams as proposed in past work. We find that abstractive,
content-based IDs (ACID) and an ID based on the first 30 tokens are very
effective in direct comparisons with previous approaches to ID creation. We
show that using ACID improves top-10 and top-20 recall by 15.6% and 14.4%
(relative) respectively versus the cluster-based integer ID baseline on the
MSMARCO 100k retrieval task, and 9.8% and 9.9% respectively on the
Wikipedia-based NQ 100k retrieval task. Our results demonstrate the
effectiveness of human-readable, natural-language IDs created through
summarization for generative retrieval. We also observed that extractive IDs
outperformed abstractive IDs on Wikipedia articles in NQ but not the snippets
in MSMARCO, which suggests that document characteristics affect generative
retrieval performance.
Authors' comments: To appear at the NLP for Wikipedia Workshop in EMNLP 2024
Felix den Breejen, Sangmin Bae, Stephen Cha, Tae-Young Kim, Seoung Hyun Koh, Se-Young Yun
While interests in tabular deep learning has significantly grown,
conventional tree-based models still outperform deep learning methods. To
narrow this performance gap, we explore the innovative retrieval mechanism, a
methodology that allows neural networks to refer to other data points while
making predictions. Our experiments reveal that retrieval-based training,
especially when fine-tuning the pretrained TabPFN model, notably surpasses
existing methods. Moreover, the extensive pretraining plays a crucial role to
enhance the performance of the model. These insights imply that blending the
retrieval mechanism with pretraining and transfer learning schemes offers
considerable potential for advancing the field of tabular deep learning.
Authors' comments: Table Representation Learning Workshop at NeurIPS 2023
Jason Niu, Ilya D. Amburg, Sinan G. Aksoy, Ahmet Erdem Sarıyüce
Complex systems frequently exhibit multi-way, rather than pairwise, interactions. These group interactions cannot be faithfully modeled as collections of pairwise interactions using graphs and instead require hypergraphs. However, methods that analyze hypergraphs directly, rather than via lossy graph reductions, remain limited. Hypergraph motifs hold promise in this regard, as motif patterns serve as building blocks for larger group interactions which are inexpressible by graphs. Recent work has focused on categorizing and counting hypergraph motifs based on the existence of nodes in hyperedge intersection regions. Here, we argue that the relative sizes of hyperedge intersections within motifs contain varied and valuable information. We propose a suite of efficient algorithms for finding top-k triplets of hyperedges based on optimizing the sizes of these intersection patterns. This formulation uncovers interesting local patterns of interaction, finding hyperedge triplets that either (1) are the least similar with each other, (2) have the highest pairwise but not groupwise correlation, or (3) are the most similar with each other. We formalize this as a combinatorial optimization problem and design efficient algorithms based on filtering hyperedges. Our comprehensive experimental evaluation shows that the resulting hyperedge triplets yield insightful information on real-world hypergraphs. Our approach is also orders of magnitude faster than a naive baseline implementation.
Xiaoqian Li, Ercong Nie, Sheng Liang
The remarkable ability of Large Language Models (LLMs) to understand and
follow instructions has sometimes been limited by their in-context learning
(ICL) performance in low-resource languages. To address this, we introduce a
novel approach that leverages cross-lingual retrieval-augmented in-context
learning (CREA-ICL). By extracting semantically similar prompts from
high-resource languages, we aim to improve the zero-shot performance of
multilingual pre-trained language models (MPLMs) across diverse tasks. Though
our approach yields steady improvements in classification tasks, it faces
challenges in generation tasks. Our evaluation offers insights into the
performance dynamics of retrieval-augmented in-context learning across both
classification and generation domains.
Authors' comments: In The Workshop on Instruction Tuning and Instruction Following, held
in conjunction with The Conference on NeurIPS 2023, December 2023
Xin Lu, Shikun Chen, Yichao Cao, Xin Zhou, Xiaobo Lu
In recent years, hashing methods have been popular in the large-scale media search for low storage and strong representation capabilities. To describe objects with similar overall appearance but subtle differences, more and more studies focus on hashing-based fine-grained image retrieval. Existing hashing networks usually generate both local and global features through attention guidance on the same deep activation tensor, which limits the diversity of feature representations. To handle this limitation, we substitute convolutional descriptors for attention-guided features and propose an Attributes Grouping and Mining Hashing (AGMH), which groups and embeds the category-specific visual attributes in multiple descriptors to generate a comprehensive feature representation for efficient fine-grained image retrieval. Specifically, an Attention Dispersion Loss (ADL) is designed to force the descriptors to attend to various local regions and capture diverse subtle details. Moreover, we propose a Stepwise Interactive External Attention (SIEA) to mine critical attributes in each descriptor and construct correlations between fine-grained attributes and objects. The attention mechanism is dedicated to learning discrete attributes, which will not cost additional computations in hash codes generation. Finally, the compact binary codes are learned by preserving pairwise similarities. Experimental results demonstrate that AGMH consistently yields the best performance against state-of-the-art methods on fine-grained benchmark datasets.