Zhangyin Feng, Xiaocheng Feng, Dezhi Zhao, Maojin Yang, Bing Qin
Large language models augmented with task-relevant documents have demonstrated impressive performance on knowledge-intensive tasks. However, regarding how to obtain effective documents, the existing methods are mainly divided into two categories. One is to retrieve from an external knowledge base, and the other is to utilize large language models to generate documents. We propose an iterative retrieval-generation collaborative framework. It is not only able to leverage both parametric and non-parametric knowledge, but also helps to find the correct reasoning path through retrieval-generation interactions, which is very important for tasks that require multi-step reasoning. We conduct experiments on four question answering datasets, including single-hop QA and multi-hop QA tasks. Empirical results show that our method significantly improves the reasoning ability of large language models and outperforms previous baselines.
Stephen Choi, William Gazeley, Siu Ho Wong, Tingting Li
This paper introduces the Conversational Factor Information Retrieval Method
(ConFIRM), a novel approach to fine-tuning large language models (LLMs) for
domain-specific retrieval tasks. ConFIRM leverages the Five-Factor Model of
personality to generate synthetic datasets that accurately reflect target
population characteristics, addressing data scarcity in specialized domains. We
demonstrate ConFIRM's effectiveness through a case study in the finance sector,
fine-tuning a Llama-2-7b model using personality-aligned data from the
PolyU-Asklora Fintech Adoption Index. The resulting model achieved 91% accuracy
in classifying financial queries, with an average inference time of 0.61
seconds on an NVIDIA A100 GPU. ConFIRM shows promise for creating more accurate
and personalized AI-driven information retrieval systems across various
domains, potentially mitigating issues of hallucinations and outdated
information in LLMs deployed
Authors' comments: 8 pages, 2 figures, 2 tables, 2 appendices
Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina et al.
Extending the context window of large language models (LLMs) is getting
popular recently, while the solution of augmenting LLMs with retrieval has
existed for years. The natural questions are: i) Retrieval-augmentation versus
long context window, which one is better for downstream tasks? ii) Can both
methods be combined to get the best of both worlds? In this work, we answer
these questions by studying both solutions using two state-of-the-art
pretrained LLMs, i.e., a proprietary 43B GPT and Llama2-70B. Perhaps
surprisingly, we find that LLM with 4K context window using simple
retrieval-augmentation at generation can achieve comparable performance to
finetuned LLM with 16K context window via positional interpolation on long
context tasks, while taking much less computation. More importantly, we
demonstrate that retrieval can significantly improve the performance of LLMs
regardless of their extended context window sizes. Our best model,
retrieval-augmented Llama2-70B with 32K context window, outperforms
GPT-3.5-turbo-16k and Davinci003 in terms of average score on nine long context
tasks including question answering, query-based summarization, and in-context
few-shot learning tasks. It also outperforms its non-retrieval Llama2-70B-32k
baseline by a margin, while being much faster at generation. Our study provides
general insights on the choice of retrieval-augmentation versus long context
extension of LLM for practitioners.
Authors' comments: Published at ICLR 2024
Brian Stern, Haoshuo Chen, Kwangwoong Kim, Hanzi Huang, Jie Zhao, Mohamad Hossein Idjadi
We demonstrate a direct-detection phase retrieval receiver based on silicon
photonics. The receiver implements strong dispersion and delay lines on a
compact chip. We retrieve the full field of a 30-GBd QPSK signal without a
carrier or local oscillator.
Authors' comments: 4 pages, 6 figures, at European Conference on Optical Communications
(ECOC) 2023
Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn et al.
Retrieval-augmented language models (RALMs) improve performance by accessing
long-tail and up-to-date knowledge from external data stores, but are
challenging to build. Existing approaches require either expensive
retrieval-specific modifications to LM pre-training or use post-hoc integration
of the data store that leads to suboptimal performance. We introduce
Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning
methodology that provides a third option by retrofitting any LLM with retrieval
capabilities. Our approach operates in two distinct fine-tuning steps: (1) one
updates a pre-trained LM to better use retrieved information, while (2) the
other updates the retriever to return more relevant results, as preferred by
the LM. By fine-tuning over tasks that require both knowledge utilization and
contextual awareness, we demonstrate that each stage yields significant
performance improvements, and using both leads to additional gains. Our best
model, RA-DIT 65B, achieves state-of-the-art performance across a range of
knowledge-intensive zero- and few-shot learning benchmarks, significantly
outperforming existing in-context RALM approaches by up to +8.9% in 0-shot
setting and +1.4% in 5-shot setting on average.
Authors' comments: 24 pages
Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert
We introduce Ragas (Retrieval Augmented Generation Assessment), a framework
for reference-free evaluation of Retrieval Augmented Generation (RAG)
pipelines. RAG systems are composed of a retrieval and an LLM based generation
module, and provide LLMs with knowledge from a reference textual database,
which enables them to act as a natural language layer between a user and
textual databases, reducing the risk of hallucinations. Evaluating RAG
architectures is, however, challenging because there are several dimensions to
consider: the ability of the retrieval system to identify relevant and focused
context passages, the ability of the LLM to exploit such passages in a faithful
way, or the quality of the generation itself. With Ragas, we put forward a
suite of metrics which can be used to evaluate these different dimensions
\textit{without having to rely on ground truth human annotations}. We posit
that such a framework can crucially contribute to faster evaluation cycles of
RAG architectures, which is especially important given the fast adoption of
LLMs.
Authors' comments: Reference-free (not tied to having ground truth available) evaluation
framework for retrieval agumented generation
Siqing Huo, Negar Arabzadeh, Charles L. A. Clarke
Current large language models (LLMs) can exhibit near-human levels of
performance on many natural language-based tasks, including open-domain
question answering. Unfortunately, at this time, they also convincingly
hallucinate incorrect answers, so that responses to questions must be verified
against external sources before they can be accepted at face value. In this
paper, we report two simple experiments to automatically validate generated
answers against a corpus. We base our experiments on questions and passages
from the MS MARCO (V1) test collection, and a retrieval pipeline consisting of
sparse retrieval, dense retrieval and neural rerankers. In the first
experiment, we validate the generated answer in its entirety. After presenting
a question to an LLM and receiving a generated answer, we query the corpus with
the combination of the question + generated answer. We then present the LLM
with the combination of the question + generated answer + retrieved answer,
prompting it to indicate if the generated answer can be supported by the
retrieved answer. In the second experiment, we consider the generated answer at
a more granular level, prompting the LLM to extract a list of factual
statements from the answer and verifying each statement separately. We query
the corpus with each factual statement and then present the LLM with the
statement and the corresponding retrieved evidence. The LLM is prompted to
indicate if the statement can be supported and make necessary edits using the
retrieved material. With an accuracy of over 80%, we find that an LLM is
capable of verifying its generated answer when a corpus of supporting material
is provided. However, manual assessment of a random sample of questions reveals
that incorrect generated answers are missed by this verification process. While
this verification process can reduce hallucinations, it can not entirely
eliminate them.
Authors' comments: arXiv admin note: text overlap with arXiv:2306.13781
Xingyu Yang, Daqing Liu, Heng Zhang, Yong Luo, Chaoyue Wang, Jing Zhang
Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image. However, most existing methods focus on the composition learning of text and reference images and oversimplify the text as a description, neglecting the inherent structure and the user's shifting intention of the texts. As a result, these methods typically take shortcuts that disregard the visual cue of the reference images. To address this issue, we reconsider the text as instructions and propose a Semantic Shift network (SSN) that explicitly decomposes the semantic shifts into two steps: from the reference image to the visual prototype and from the visual prototype to the target image. Specifically, SSN explicitly decomposes the instructions into two components: degradation and upgradation, where the degradation is used to picture the visual prototype from the reference image, while the upgradation is used to enrich the visual prototype into the final representations to retrieve the desired target image. The experimental results show that the proposed SSN demonstrates a significant improvement of 5.42% and 1.37% on the CIRR and FashionIQ datasets, respectively, and establishes a new state-of-the-art performance. Codes will be publicly available.
Elias Ramzi, Nicolas Audebert, Clément Rambour, André Araujo, Xavier Bitot, Nicolas Thome
In image retrieval, standard evaluation metrics rely on score ranking, \eg
average precision (AP), recall at k (R@k), normalized discounted cumulative
gain (NDCG). In this work we introduce a general framework for robust and
decomposable rank losses optimization. It addresses two major challenges for
end-to-end training of deep neural networks with rank losses:
non-differentiability and non-decomposability. Firstly we propose a general
surrogate for ranking operator, SupRank, that is amenable to stochastic
gradient descent. It provides an upperbound for rank losses and ensures robust
training. Secondly, we use a simple yet effective loss function to reduce the
decomposability gap between the averaged batch approximation of ranking losses
and their values on the whole training set. We apply our framework to two
standard metrics for image retrieval: AP and R@k. Additionally we apply our
framework to hierarchical image retrieval. We introduce an extension of AP, the
hierarchical average precision $\mathcal{H}$-AP, and optimize it as well as the
NDCG. Finally we create the first hierarchical landmarks retrieval dataset. We
use a semi-automatic pipeline to create hierarchical labels, extending the
large scale Google Landmarks v2 dataset. The hierarchical dataset is publicly
available at https://github.com/cvdfoundation/google-landmark. Code will be
released at https://github.com/elias-ramzi/SupRank.
Authors' comments: arXiv admin note: text overlap with arXiv:2207.04873
Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
Despite recent progress in text-to-audio (TTA) generation, we show that the
state-of-the-art models, such as AudioLDM, trained on datasets with an
imbalanced class distribution, such as AudioCaps, are biased in their
generation performance. Specifically, they excel in generating common audio
classes while underperforming in the rare ones, thus degrading the overall
generation performance. We refer to this problem as long-tailed text-to-audio
generation. To address this issue, we propose a simple retrieval-augmented
approach for TTA models. Specifically, given an input text prompt, we first
leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve
relevant text-audio pairs. The features of the retrieved audio-text data are
then used as additional conditions to guide the learning of TTA models. We
enhance AudioLDM with our proposed approach and denote the resulting augmented
system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a
state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the
existing approaches by a large margin. Furthermore, we show that Re-AudioLDM
can generate realistic audio for complex scenes, rare audio classes, and even
unseen audio types, indicating its potential in TTA tasks.
Authors' comments: Accepted by ICASSP 2024
Xintong Jiang, Yaxiong Wang, Yujiao Wu, Meng Wang, Xueming Qian
Composed image retrieval, a task involving the search for a target image
using a reference image and a complementary text as the query, has witnessed
significant advancements owing to the progress made in cross-modal modeling.
Unlike the general image-text retrieval problem with only one alignment
relation, i.e., image-text, we argue for the existence of two types of
relations in composed image retrieval. The explicit relation pertains to the
reference image & complementary text-target image, which is commonly exploited
by existing methods. Besides this intuitive relation, the observations during
our practice have uncovered another implicit yet crucial relation, i.e.,
reference image & target image-complementary text, since we found that the
complementary text can be inferred by studying the relation between the target
image and the reference image. Regrettably, existing methods largely focus on
leveraging the explicit relation to learn their networks, while overlooking the
implicit relation. In response to this weakness, We propose a new framework for
composed image retrieval, termed dual relation alignment, which integrates both
explicit and implicit relations to fully exploit the correlations among the
triplets. Specifically, we design a vision compositor to fuse reference image
and target image at first, then the resulted representation will serve two
roles: (1) counterpart for semantic alignment with the complementary text and
(2) compensation for the complementary text to boost the explicit relation
modeling, thereby implant the implicit relation into the alignment learning.
Our method is evaluated on two popular datasets, CIRR and FashionIQ, through
extensive experiments. The results confirm the effectiveness of our
dual-relation learning in substantially enhancing composed image retrieval
performance.
Authors' comments: The architecture of our model changes, hence methodolgy and
experiments changes a lot, We have significantly revised the original
manuscript of the paper, so a withdraw of our original script is needed
Andres Ferraro, Jaehun Kim, Sergio Oramas, Andreas Ehmann, Fabien Gouyon
Music retrieval and recommendation applications often rely on content features encoded as embeddings, which provide vector representations of items in a music dataset. Numerous complementary embeddings can be derived from processing items originally represented in several modalities, e.g., audio signals, user interaction data, or editorial data. However, data of any given modality might not be available for all items in any music dataset. In this work, we propose a method based on contrastive learning to combine embeddings from multiple modalities and explore the impact of the presence or absence of embeddings from diverse modalities in an artist similarity task. Experiments on two datasets suggest that our contrastive method outperforms single-modality embeddings and baseline algorithms for combining modalities, both in terms of artist retrieval accuracy and coverage. Improvements with respect to other methods are particularly significant for less popular query artists. We demonstrate our method successfully combines complementary information from diverse modalities, and is more robust to missing modality data (i.e., it better handles the retrieval of artists with different modality embeddings than the query artist's).
Lukas Liehr
We study the determination of a holomorphic function from its absolute value.
Given a parameter $\theta \in \mathbb{R}$, we derive the following
characterization of uniqueness in terms of rigidity of a set $\Lambda \subseteq
\mathbb{R}$: if $\mathcal{F}$ is a vector space of entire functions containing
all exponentials $e^{\xi z}, \, \xi \in \mathbb{C} \setminus \{ 0 \}$, then
every $F \in \mathcal{F}$ is uniquely determined up to a unimodular phase
factor by $\{|F(z)| : z \in e^{i\theta}(\mathbb{R} + i\Lambda)\}$ if and only
if $\Lambda$ is not contained in an arithmetic progression $a\mathbb{Z}+b$.
Leveraging this insight, we establish a series of consequences for Gabor phase
retrieval and Pauli-type uniqueness problems. For instance, $\mathbb{Z} \times
\tilde{\mathbb{Z}}$ is a uniqueness set for the Gabor phase retrieval problem
in $L^2(\mathbb{R}_+)$, provided that $\tilde{\mathbb{Z}}$ is a suitable
perturbation of the integers.
Authors' comments: 14 pages
Yunqiu Shao, Haitao Li, Yueyue Wu, Yiqun Liu, Qingyao Ai, Jiaxin Mao, Yixiao Ma, Shaoping Ma
Legal case retrieval is a special Information Retrieval~(IR) task focusing on
legal case documents. Depending on the downstream tasks of the retrieved case
documents, users' information needs in legal case retrieval could be
significantly different from those in Web search and traditional ad-hoc
retrieval tasks. While there are several studies that retrieve legal cases
based on text similarity, the underlying search intents of legal retrieval
users, as shown in this paper, are more complicated than that yet mostly
unexplored. To this end, we present a novel hierarchical intent taxonomy of
legal case retrieval. It consists of five intent types categorized by three
criteria, i.e., search for Particular Case(s), Characterization, Penalty,
Procedure, and Interest. The taxonomy was constructed transparently and
evaluated extensively through interviews, editorial user studies, and query log
analysis. Through a laboratory user study, we reveal significant differences in
user behavior and satisfaction under different search intents in legal case
retrieval. Furthermore, we apply the proposed taxonomy to various downstream
legal retrieval tasks, e.g., result ranking and satisfaction prediction, and
demonstrate its effectiveness. Our work provides important insights into the
understanding of user intents in legal case retrieval and potentially leads to
better retrieval techniques in the legal domain, such as intent-aware ranking
strategies and evaluation methodologies.
Authors' comments: 28 pages, work in process
Andrea Bacciu, Florin Cuconasu, Federico Siciliano, Fabrizio Silvestri, Nicola Tonellotto, Giovanni Trappolini
The emergence of large language models (LLMs) has revolutionized machine learning and related fields, showcasing remarkable abilities in comprehending, generating, and manipulating human language. However, their conventional usage through API-based text prompt submissions imposes certain limitations in terms of context constraints and external source availability. To address these challenges, we propose a novel framework called Reinforced Retrieval Augmented Machine Learning (RRAML). RRAML integrates the reasoning capabilities of LLMs with supporting information retrieved by a purpose-built retriever from a vast user-provided database. By leveraging recent advancements in reinforcement learning, our method effectively addresses several critical challenges. Firstly, it circumvents the need for accessing LLM gradients. Secondly, our method alleviates the burden of retraining LLMs for specific tasks, as it is often impractical or impossible due to restricted access to the model and the computational intensity involved. Additionally we seamlessly link the retriever's task with the reasoner, mitigating hallucinations and reducing irrelevant, and potentially damaging retrieved documents. We believe that the research agenda outlined in this paper has the potential to profoundly impact the field of AI, democratizing access to and utilization of LLMs for a wide range of entities.
Dan Edidin, Arun Suresh
In this paper we consider the problem of recovering a signal $x \in
\mathbb{R}^N$ from its power spectrum assuming that the signal is sparse with
respect to a generic basis for $\mathbb{R}^N$. Our main result is that if the
sparsity level is at most $\sim\! N/2$ in this basis then the generic sparse
vector is uniquely determined up to sign from its power spectrum. We also prove
that if the sparsity level is $\sim\! N/4$ then every sparse vector is
determined up to sign from its power spectrum. Analogous results are also
obtained for the power spectrum of a vector in $\mathbb{C}^N$ which extend
earlier results of Wang and Xu \cite{arXiv:1310.0873}.
Authors' comments: 20 pages
L. Siddharth, Jianxi Luo
Aiming to support Retrieval Augmented Generation (RAG) in the design process,
we present a method to identify explicit, engineering design facts - {head
entity :: relationship :: tail entity} from patented artefact descriptions.
Given a sentence with a pair of entities (based on noun phrases) marked in a
unique manner, our method extracts the relationship that is explicitly
communicated in the sentence. For this task, we create a dataset of 375,084
examples and fine-tune language models for relation identification (token
classification) and elicitation (sequence-to-sequence). The token
classification approach achieves up to 99.7 % accuracy. Upon applying the
method to a domain of 4,870 fan system patents, we populate a knowledge base of
over 2.93 million facts. Using this knowledge base, we demonstrate how Large
Language Models (LLMs) are guided by explicit facts to synthesise knowledge and
generate technical and cohesive responses when sought out for knowledge
retrieval tasks in the design process.
Authors' comments: Resources: Dataset -
https://huggingface.co/datasets/siddharthl1293/engineering_design_facts
Training Infrastructure - https://zenodo.org/records/12012131 Trained model -
https://huggingface.co/siddharthl1293/albert-albert-large-v2 Application -
https://github.com/siddharthl93/engineering-design-knowledge
Brunello Tirozzi, Orchidea Maria Lecian
A phoneme-retrieval technique is proposed, which is due to the particular way
of the construction of the network. An initial set of neurons is given. The
number of these neurons is approximately equal to the number of typical
structures of the data. For example if the network is built for voice retrieval
then the number of neurons must be equal to the number of characteristic
phonemes of the alphabet of the language spoken by the social group to which
the particular person belongs. Usually this task is very complicated and the
network can depend critically on the samples used for the learning. If the
network is built for image retrieval then it works only if the data to be
retrieved belong to a particular set of images. If the network is built for
voice recognition it works only for some particular set of words. A typical
example is the words used for the flight of airplanes. For example a command
like the "airplane should make a turn of 120 degrees towards the east" can be
easily recognized by the network if a suitable learning procedure is used.
Authors' comments: 10 pages
Helia Hashemi, Yong Zhuang, Sachith Sri Ram Kothur, Srivas Prasad, Edgar Meij, W. Bruce Croft
In information retrieval (IR), domain adaptation is the process of adapting a retrieval model to a new domain whose data distribution is different from the source domain. Existing methods in this area focus on unsupervised domain adaptation where they have access to the target document collection or supervised (often few-shot) domain adaptation where they additionally have access to (limited) labeled data in the target domain. There also exists research on improving zero-shot performance of retrieval models with no adaptation. This paper introduces a new category of domain adaptation in IR that is as-yet unexplored. Here, similar to the zero-shot setting, we assume the retrieval model does not have access to the target document collection. In contrast, it does have access to a brief textual description that explains the target domain. We define a taxonomy of domain attributes in retrieval tasks to understand different properties of a source domain that can be adapted to a target domain. We introduce a novel automatic data construction pipeline that produces a synthetic document collection, query set, and pseudo relevance labels, given a textual domain description. Extensive experiments on five diverse target domains show that adapting dense retrieval models using the constructed synthetic data leads to effective retrieval performance on the target domain.
Palina Salanevich
In many signal processing problems arising in practical applications, we wish to reconstruct an unknown signal from its phaseless measurements with respect to a frame. This inverse problem is known as the phase retrieval problem. For each particular application, the set of relevant measurement frames is determined by the problem at hand, which motivates the study of phase retrieval for structured, application-relevant frames. In this paper, we focus on one class of such frames that appear naturally in diffraction imaging, ptychography, and audio processing, namely, multi-window Gabor frames. We study the question of injectivity of the phase retrieval problem with these measurement frames in the finite-dimensional setup and propose an explicit construction of an infinite family of phase retrievable multi-window Gabor frames. We show that phase retrievability for the constructed frames can be achieved with a much smaller number of phaseless measurements compared to the previous results for this type of measurement frames. Additionally, we show that the sufficient for reconstruction number of phaseless measurements depends on the dimension of the signal space, and not on the ambient dimension of the problem.