Lukas Liehr
We study the determination of a holomorphic function from its absolute value.
Given a parameter $\theta \in \mathbb{R}$, we derive the following
characterization of uniqueness in terms of rigidity of a set $\Lambda \subseteq
\mathbb{R}$: if $\mathcal{F}$ is a vector space of entire functions containing
all exponentials $e^{\xi z}, \, \xi \in \mathbb{C} \setminus \{ 0 \}$, then
every $F \in \mathcal{F}$ is uniquely determined up to a unimodular phase
factor by $\{|F(z)| : z \in e^{i\theta}(\mathbb{R} + i\Lambda)\}$ if and only
if $\Lambda$ is not contained in an arithmetic progression $a\mathbb{Z}+b$.
Leveraging this insight, we establish a series of consequences for Gabor phase
retrieval and Pauli-type uniqueness problems. For instance, $\mathbb{Z} \times
\tilde{\mathbb{Z}}$ is a uniqueness set for the Gabor phase retrieval problem
in $L^2(\mathbb{R}_+)$, provided that $\tilde{\mathbb{Z}}$ is a suitable
perturbation of the integers.
Authors' comments: 14 pages
Yunqiu Shao, Haitao Li, Yueyue Wu, Yiqun Liu, Qingyao Ai, Jiaxin Mao, Yixiao Ma, Shaoping Ma
Legal case retrieval is a special Information Retrieval~(IR) task focusing on
legal case documents. Depending on the downstream tasks of the retrieved case
documents, users' information needs in legal case retrieval could be
significantly different from those in Web search and traditional ad-hoc
retrieval tasks. While there are several studies that retrieve legal cases
based on text similarity, the underlying search intents of legal retrieval
users, as shown in this paper, are more complicated than that yet mostly
unexplored. To this end, we present a novel hierarchical intent taxonomy of
legal case retrieval. It consists of five intent types categorized by three
criteria, i.e., search for Particular Case(s), Characterization, Penalty,
Procedure, and Interest. The taxonomy was constructed transparently and
evaluated extensively through interviews, editorial user studies, and query log
analysis. Through a laboratory user study, we reveal significant differences in
user behavior and satisfaction under different search intents in legal case
retrieval. Furthermore, we apply the proposed taxonomy to various downstream
legal retrieval tasks, e.g., result ranking and satisfaction prediction, and
demonstrate its effectiveness. Our work provides important insights into the
understanding of user intents in legal case retrieval and potentially leads to
better retrieval techniques in the legal domain, such as intent-aware ranking
strategies and evaluation methodologies.
Authors' comments: 28 pages, work in process
Andrea Bacciu, Florin Cuconasu, Federico Siciliano, Fabrizio Silvestri, Nicola Tonellotto, Giovanni Trappolini
The emergence of large language models (LLMs) has revolutionized machine learning and related fields, showcasing remarkable abilities in comprehending, generating, and manipulating human language. However, their conventional usage through API-based text prompt submissions imposes certain limitations in terms of context constraints and external source availability. To address these challenges, we propose a novel framework called Reinforced Retrieval Augmented Machine Learning (RRAML). RRAML integrates the reasoning capabilities of LLMs with supporting information retrieved by a purpose-built retriever from a vast user-provided database. By leveraging recent advancements in reinforcement learning, our method effectively addresses several critical challenges. Firstly, it circumvents the need for accessing LLM gradients. Secondly, our method alleviates the burden of retraining LLMs for specific tasks, as it is often impractical or impossible due to restricted access to the model and the computational intensity involved. Additionally we seamlessly link the retriever's task with the reasoner, mitigating hallucinations and reducing irrelevant, and potentially damaging retrieved documents. We believe that the research agenda outlined in this paper has the potential to profoundly impact the field of AI, democratizing access to and utilization of LLMs for a wide range of entities.
Dan Edidin, Arun Suresh
In this paper we consider the problem of recovering a signal $x \in
\mathbb{R}^N$ from its power spectrum assuming that the signal is sparse with
respect to a generic basis for $\mathbb{R}^N$. Our main result is that if the
sparsity level is at most $\sim\! N/2$ in this basis then the generic sparse
vector is uniquely determined up to sign from its power spectrum. We also prove
that if the sparsity level is $\sim\! N/4$ then every sparse vector is
determined up to sign from its power spectrum. Analogous results are also
obtained for the power spectrum of a vector in $\mathbb{C}^N$ which extend
earlier results of Wang and Xu \cite{arXiv:1310.0873}.
Authors' comments: 20 pages
L. Siddharth, Jianxi Luo
Aiming to support Retrieval Augmented Generation (RAG) in the design process,
we present a method to identify explicit, engineering design facts - {head
entity :: relationship :: tail entity} from patented artefact descriptions.
Given a sentence with a pair of entities (based on noun phrases) marked in a
unique manner, our method extracts the relationship that is explicitly
communicated in the sentence. For this task, we create a dataset of 375,084
examples and fine-tune language models for relation identification (token
classification) and elicitation (sequence-to-sequence). The token
classification approach achieves up to 99.7 % accuracy. Upon applying the
method to a domain of 4,870 fan system patents, we populate a knowledge base of
over 2.93 million facts. Using this knowledge base, we demonstrate how Large
Language Models (LLMs) are guided by explicit facts to synthesise knowledge and
generate technical and cohesive responses when sought out for knowledge
retrieval tasks in the design process.
Authors' comments: Resources: Dataset -
https://huggingface.co/datasets/siddharthl1293/engineering_design_facts
Training Infrastructure - https://zenodo.org/records/12012131 Trained model -
https://huggingface.co/siddharthl1293/albert-albert-large-v2 Application -
https://github.com/siddharthl93/engineering-design-knowledge
Brunello Tirozzi, Orchidea Maria Lecian
A phoneme-retrieval technique is proposed, which is due to the particular way
of the construction of the network. An initial set of neurons is given. The
number of these neurons is approximately equal to the number of typical
structures of the data. For example if the network is built for voice retrieval
then the number of neurons must be equal to the number of characteristic
phonemes of the alphabet of the language spoken by the social group to which
the particular person belongs. Usually this task is very complicated and the
network can depend critically on the samples used for the learning. If the
network is built for image retrieval then it works only if the data to be
retrieved belong to a particular set of images. If the network is built for
voice recognition it works only for some particular set of words. A typical
example is the words used for the flight of airplanes. For example a command
like the "airplane should make a turn of 120 degrees towards the east" can be
easily recognized by the network if a suitable learning procedure is used.
Authors' comments: 10 pages
Helia Hashemi, Yong Zhuang, Sachith Sri Ram Kothur, Srivas Prasad, Edgar Meij, W. Bruce Croft
In information retrieval (IR), domain adaptation is the process of adapting a retrieval model to a new domain whose data distribution is different from the source domain. Existing methods in this area focus on unsupervised domain adaptation where they have access to the target document collection or supervised (often few-shot) domain adaptation where they additionally have access to (limited) labeled data in the target domain. There also exists research on improving zero-shot performance of retrieval models with no adaptation. This paper introduces a new category of domain adaptation in IR that is as-yet unexplored. Here, similar to the zero-shot setting, we assume the retrieval model does not have access to the target document collection. In contrast, it does have access to a brief textual description that explains the target domain. We define a taxonomy of domain attributes in retrieval tasks to understand different properties of a source domain that can be adapted to a target domain. We introduce a novel automatic data construction pipeline that produces a synthetic document collection, query set, and pseudo relevance labels, given a textual domain description. Extensive experiments on five diverse target domains show that adapting dense retrieval models using the constructed synthetic data leads to effective retrieval performance on the target domain.
Palina Salanevich
In many signal processing problems arising in practical applications, we wish to reconstruct an unknown signal from its phaseless measurements with respect to a frame. This inverse problem is known as the phase retrieval problem. For each particular application, the set of relevant measurement frames is determined by the problem at hand, which motivates the study of phase retrieval for structured, application-relevant frames. In this paper, we focus on one class of such frames that appear naturally in diffraction imaging, ptychography, and audio processing, namely, multi-window Gabor frames. We study the question of injectivity of the phase retrieval problem with these measurement frames in the finite-dimensional setup and propose an explicit construction of an infinite family of phase retrievable multi-window Gabor frames. We show that phase retrievability for the constructed frames can be achieved with a much smaller number of phaseless measurements compared to the previous results for this type of measurement frames. Additionally, we show that the sufficient for reconstruction number of phaseless measurements depends on the dimension of the signal space, and not on the ambient dimension of the problem.
Wenzheng Zhang, Chenyan Xiong, Karl Stratos, Arnold Overwijk
In multitask retrieval, a single retriever is trained to retrieve relevant
contexts for multiple tasks. Despite its practical appeal, naive multitask
retrieval lags behind task-specific retrieval in which a separate retriever is
trained for each task. We show that it is possible to train a multitask
retriever that outperforms task-specific retrievers by promoting task
specialization. The main ingredients are: (1) a better choice of pretrained
model (one that is explicitly optimized for multitasking) along with compatible
prompting, and (2) a novel adaptive learning method that encourages each
parameter to specialize in a particular task. The resulting multitask retriever
is highly performant on the KILT benchmark. Upon analysis, we find that the
model indeed learns parameters that are more task-specialized compared to naive
multitasking without prompting or adaptive learning.
Authors' comments: TACL 2023
Iain Mackie, Shubham Chatterjee, Sean MacAvaney, Jeffrey Dalton
Despite considerable progress in neural relevance ranking techniques, search engines still struggle to process complex queries effectively - both in terms of precision and recall. Sparse and dense Pseudo-Relevance Feedback (PRF) approaches have the potential to overcome limitations in recall, but are only effective with high precision in the top ranks. In this work, we tackle the problem of search over complex queries using three complementary techniques. First, we demonstrate that applying a strong neural re-ranker before sparse or dense PRF can improve the retrieval effectiveness by 5-8%. This improvement in PRF effectiveness can be attributed directly to improving the precision of the feedback set. Second, we propose an enhanced expansion model, Latent Entity Expansion (LEE), which applies fine-grained word and entity-based relevance modelling incorporating localized features. Specifically, we find that by including both words and entities for expansion achieve a further 2-8% improvement in NDCG. Our analysis also demonstrated that LEE is largely robust to its parameters across datasets and performs well on entity-centric queries. And third, we include an 'adaptive' component in the retrieval process, which iteratively refines the re-ranking pool during scoring using the expansion model and avoids re-ranking additional documents. We find that this combination of techniques achieves the best NDCG, MAP and R@1000 results on the TREC Robust 2004 and CODEC document datasets, demonstrating a significant advancement in expansion effectiveness.
William Yang, Noah Bergam, Arnav Jain, Nima Sheikhoslami
In this paper, we consider the extent to which the transformer-based Dense Passage Retrieval (DPR) algorithm, developed by (Karpukhin et. al. 2020), can be optimized without further pre-training. Our method involves two particular insights: we apply the DPR context encoder at various phrase lengths (e.g. one-sentence versus five-sentence segments), and we take a confidence-calibrated ensemble prediction over all of these different segmentations. This somewhat exhaustive approach achieves start-of-the-art results on benchmark datasets such as Google NQ and SQuAD. We also apply our method to domain-specific datasets, and the results suggest how different granularities are optimal for different domains
Yongqi Li, Nan Yang, Liang Wang, Furu Wei, Wenjie Li
Generative retrieval stands out as a promising new paradigm in text retrieval
that aims to generate identifier strings of relevant passages as the retrieval
target. This generative paradigm taps into powerful generative language models,
distinct from traditional sparse or dense retrieval methods. However, only
learning to generate is insufficient for generative retrieval. Generative
retrieval learns to generate identifiers of relevant passages as an
intermediate goal and then converts predicted identifiers into the final
passage rank list. The disconnect between the learning objective of
autoregressive models and the desired passage ranking target leads to a
learning gap. To bridge this gap, we propose a learning-to-rank framework for
generative retrieval, dubbed LTRGR. LTRGR enables generative retrieval to learn
to rank passages directly, optimizing the autoregressive model toward the final
passage ranking target via a rank loss. This framework only requires an
additional learning-to-rank training phase to enhance current generative
retrieval systems and does not add any burden to the inference stage. We
conducted experiments on three public benchmarks, and the results demonstrate
that LTRGR achieves state-of-the-art performance among generative retrieval
methods. The code and checkpoints are released at
https://github.com/liyongqi67/LTRGR.
Authors' comments: AAAI 2024
Erik Malm
Phase retrieval is the numerical procedure of recovering a complex-valued signal from knowledge about its amplitude and some additional information. Here, an indirect registration procedure, based on the large deformation diffeomorphic metric mapping (LDDMM) formalism, is investigated as a phase retrieval method for coherent diffractive imaging. The method attempts to find a deformation which transforms an initial, template image to match an unknown target image by comparing the diffraction pattern to the data. The exterior calculus framework is used to treat different types of deformations in a unified and coordinate-free way. The algorithm performance with respect to measurement noise, image topology, and particular action are explored through numerical examples.
Siqing Huo, Negar Arabzadeh, Charles L. A. Clarke
Current large language models (LLMs) can exhibit near-human levels of performance on many natural language tasks, including open-domain question answering. Unfortunately, they also convincingly hallucinate incorrect answers, so that responses to questions must be verified against external sources before they can be accepted at face value. In this paper, we report a simple experiment to automatically verify generated answers against a corpus. After presenting a question to an LLM and receiving a generated answer, we query the corpus with the combination of the question + generated answer. We then present the LLM with the combination of the question + generated answer + retrieved answer, prompting it to indicate if the generated answer can be supported by the retrieved answer. We base our experiment on questions and passages from the MS MARCO (V1) test collection, exploring three retrieval approaches ranging from standard BM25 to a full question answering stack, including a reader based on the LLM. For a large fraction of questions, we find that an LLM is capable of verifying its generated answer if appropriate supporting material is provided. However, with an accuracy of 70-80%, this approach cannot be fully relied upon to detect hallucinations.
Michael Glass, Xueqing Wu, Ankita Rajaram Naik, Gaetano Rossiello, Alfio Gliozzo
Data preparation, also called data wrangling, is considered one of the most
expensive and time-consuming steps when performing analytics or building
machine learning models. Preparing data typically involves collecting and
merging data from complex heterogeneous, and often large-scale data sources,
such as data lakes. In this paper, we introduce a novel approach toward
automatic data wrangling in an attempt to alleviate the effort of end-users,
e.g. data analysts, in structuring dynamic views from data lakes in the form of
tabular data. We aim to address table augmentation tasks, including row/column
population and data imputation. Given a corpus of tables, we propose a
retrieval augmented self-trained transformer model. Our self-learning strategy
consists in randomly ablating tables from the corpus and training the
retrieval-based model to reconstruct the original values or headers given the
partial tables as input. We adopt this strategy to first train the dense neural
retrieval model encoding table-parts to vectors, and then the end-to-end model
trained to perform table augmentation tasks. We test on EntiTables, the
standard benchmark for table augmentation, as well as introduce a new benchmark
to advance further research: WebTables. Our model consistently and
substantially outperforms both supervised statistical methods and the current
state-of-the-art transformer-based models.
Authors' comments: Findings of ACL 2023
Panuthep Tasawong, Wuttikorn Ponwitayarat, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Ekapol Chuangsuwanich, Sarana Nutanong
Dense retrieval is a basic building block of information retrieval
applications. One of the main challenges of dense retrieval in real-world
settings is the handling of queries containing misspelled words. A popular
approach for handling misspelled queries is minimizing the representations
discrepancy between misspelled queries and their pristine ones. Unlike the
existing approaches, which only focus on the alignment between misspelled and
pristine queries, our method also improves the contrast between each misspelled
query and its surrounding queries. To assess the effectiveness of our proposed
method, we compare it against the existing competitors using two benchmark
datasets and two base encoders. Our method outperforms the competitors in all
cases with misspelled queries. Our code and models are available at
https://github. com/panuthept/DST-DenseRetrieval.
Authors' comments: 5 pages, 2 figures
Sijie Zhao, Yixiao Ge, Zhongang Qi, Lin Song, Xiaohan Ding, Zehua Xie, Ying Shan
Stickers have become a ubiquitous part of modern-day communication, conveying complex emotions through visual imagery. To facilitate the development of more powerful algorithms for analyzing stickers, we propose a large-scale Chinese sticker dataset, namely Sticker820K, which consists of 820k image-text pairs. Each sticker has rich and high-quality textual annotations, including descriptions, optical characters, emotional labels, and style classifications. Although vision-language tasks in the domain of natural images have been well studied, directly applying the those models, such as CLIP, to sticker data is not an optimal solution due to the discrepant nature between natural and emotive image data. Therefore, we propose StickerCLIP as a benchmark model on the Sticker820K dataset. For the text-to-image retrieval task, our StickerCLIP demonstrates strong superiority over the CLIP, which achieves an absolute gain of 66.0\% in mean recall on the Sticker820K test set. Additionally, we endeavor to extend the recently popularized LLM by means of prompt tuning, integrating its ability for sticker retrieval and allowing users to retrieve stickers through instructions. We validate the feasibility of this method, demonstrating the immense potential of prompt tuning in expanding LLM abilities while not affecting the quality of upstream tasks.
Ahmet Iscen, Mathilde Caron, Alireza Fathi, Cordelia Schmid
Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from the pre-training dataset. Hence, a key ingredient to their success has been the use of large-scale curated pre-training data aiming at expanding the set of concepts that they can memorize during the pre-training stage. In this work, we explore an alternative to encoding fine-grained knowledge directly into the model's parameters: we instead train the model to retrieve this knowledge from an external memory. Specifically, we propose to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time, which greatly improves their zero-shot predictions. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark, where we even outperform the fine-tuned models on unseen classes.
Yikun Liu, Jiangchao Yao, Ya Zhang, Yanfeng Wang, Weidi Xie
In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion mechanism, to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) when evaluating on the publicly available benckmarks under the zero-shot scenario, i.e., training on the automatically constructed datasets, then directly conduct inference on target downstream datasets, e.g., CIRR and FashionIQ, our proposed approach either performs on par with or significantly outperforms the existing state-of-the-art (SOTA) models. Project page: https://code-kunkun.github.io/ZS-CIR/
Shufang Xie, Rui Yan, Junliang Guo, Yingce Xia, Lijun Wu, Tao Qin
Retrosynthesis, which predicts the reactants of a given target molecule, is
an essential task for drug discovery. In recent years, the machine learing
based retrosynthesis methods have achieved promising results. In this work, we
introduce RetroKNN, a local reaction template retrieval method to further boost
the performance of template-based systems with non-parametric retrieval. We
first build an atom-template store and a bond-template store that contain the
local templates in the training data, then retrieve from these templates with a
k-nearest-neighbor (KNN) search during inference. The retrieved templates are
combined with neural network predictions as the final output. Furthermore, we
propose a lightweight adapter to adjust the weights when combing neural network
and KNN predictions conditioned on the hidden representation and the retrieved
templates. We conduct comprehensive experiments on two widely used benchmarks,
the USPTO-50K and USPTO-MIT. Especially for the top-1 accuracy, we improved
7.1% on the USPTO-50K dataset and 12.0% on the USPTO-MIT dataset. These results
demonstrate the effectiveness of our method.
Authors' comments: AAAI-2023 camera ready