Yile Wang, Peng Li, Maosong Sun, Yang Liu
Large language models (LLMs) have shown superior performance without task-specific fine-tuning. Despite the success, the knowledge stored in the parameters of LLMs could still be incomplete and difficult to update due to the computational costs. As complementary, retrieval-based methods can offer non-parametric world knowledge and improve the performance on tasks such as question answering. However, we find that the retrieved knowledge does not always help and even has a negative impact on original responses occasionally. To better make use of both internal knowledge and external world knowledge, we investigate eliciting the model's ability to recognize what they know and do not know (which is also called self-knowledge) and propose Self-Knowledge guided Retrieval augmentation (SKR), a simple yet effective method which can let LLMs refer to the questions they have previously encountered and adaptively call for external resources when dealing with new questions. We evaluate SKR on multiple datasets and demonstrate that it outperforms chain-of-thought based and fully retrieval-based methods by using either InstructGPT or ChatGPT.
Boyu Zhang, Hongyang Yang, Tianyu Zhou, Ali Babar, Xiao-Yang Liu
Financial sentiment analysis is critical for valuation and investment
decision-making. Traditional NLP models, however, are limited by their
parameter size and the scope of their training datasets, which hampers their
generalization capabilities and effectiveness in this field. Recently, Large
Language Models (LLMs) pre-trained on extensive corpora have demonstrated
superior performance across various NLP tasks due to their commendable
zero-shot abilities. Yet, directly applying LLMs to financial sentiment
analysis presents challenges: The discrepancy between the pre-training
objective of LLMs and predicting the sentiment label can compromise their
predictive performance. Furthermore, the succinct nature of financial news,
often devoid of sufficient context, can significantly diminish the reliability
of LLMs' sentiment analysis. To address these challenges, we introduce a
retrieval-augmented LLMs framework for financial sentiment analysis. This
framework includes an instruction-tuned LLMs module, which ensures LLMs behave
as predictors of sentiment labels, and a retrieval-augmentation module which
retrieves additional context from reliable external sources. Benchmarked
against traditional models and LLMs like ChatGPT and LLaMA, our approach
achieves 15\% to 48\% performance gain in accuracy and F1 score.
Authors' comments: ACM International Conference on AI in Finance (ICAIF) 2023
Fangyuan Xu, Weijia Shi, Eunsol Choi
Retrieving documents and prepending them in-context at inference time improves performance of language model (LMs) on a wide range of tasks. However, these documents, often spanning hundreds of words, make inference substantially more expensive. We propose compressing the retrieved documents into textual summaries prior to in-context integration. This not only reduces the computational costs but also relieves the burden of LMs to identify relevant information in long retrieved documents. We present two compressors -- an extractive compressor which selects useful sentences from retrieved documents and an abstractive compressor which generates summaries by synthesizing information from multiple documents. Both compressors are trained to improve LMs' performance on end tasks when the generated summaries are prepended to the LMs' input, while keeping the summary concise.If the retrieved documents are irrelevant to the input or offer no additional information to LM, our compressor can return an empty string, implementing selective augmentation.We evaluate our approach on language modeling task and open domain question answering task. We achieve a compression rate of as low as 6% with minimal loss in performance for both tasks, significantly outperforming the off-the-shelf summarization models. We show that our compressors trained for one LM can transfer to other LMs on the language modeling task and provide summaries largely faithful to the retrieved documents.
Chin-Chia Michael Yeh, Huiyuan Chen, Xin Dai, Yan Zheng, Junpeng Wang, Vivian Lai, Yujie Fan, Audrey Der et al.
A Content-based Time Series Retrieval (CTSR) system is an information retrieval system for users to interact with time series emerged from multiple domains, such as finance, healthcare, and manufacturing. For example, users seeking to learn more about the source of a time series can submit the time series as a query to the CTSR system and retrieve a list of relevant time series with associated metadata. By analyzing the retrieved metadata, users can gather more information about the source of the time series. Because the CTSR system is required to work with time series data from diverse domains, it needs a high-capacity model to effectively measure the similarity between different time series. On top of that, the model within the CTSR system has to compute the similarity scores in an efficient manner as the users interact with the system in real-time. In this paper, we propose an effective and efficient CTSR model that outperforms alternative models, while still providing reasonable inference runtimes. To demonstrate the capability of the proposed method in solving business problems, we compare it against alternative models using our in-house transaction data. Our findings reveal that the proposed model is the most suitable solution compared to others for our transaction data problem.
Brian Stern, Hanzi Huang, Haoshuo Chen, Kwangwoong Kim, Mohamad Hossein Idjadi
We demonstrate a silicon photonic dual-polarization phase retrieval receiver.
The receiver recovers phase from intensity-only measurements without a local
oscillator or transmitted carrier. We design silicon waveguides providing long
delays and microring resonators with large dispersion to enable
symbol-to-symbol interference and dispersive projection in the phase retrieval
algorithm. We retrieve the full field of a polarization-division multiplexed
30-GBd QPSK and 20-GBd 8QAM signals over 80 km of SSMF.
Authors' comments: 11 pages, 7 figures
Ori Yoran, Tomer Wolfson, Ori Ram, Jonathan Berant
Retrieval-augmented language models (RALMs) hold promise to produce language understanding systems that are are factual, efficient, and up-to-date. An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant, and does not harm performance when it is not. This is particularly important in multi-hop reasoning scenarios, where misuse of irrelevant evidence can lead to cascading errors. However, recent work has shown that retrieval augmentation can sometimes have a negative effect on performance. In this work, we present a thorough analysis on five open-domain question answering benchmarks, characterizing cases when retrieval reduces accuracy. We then propose two methods to mitigate this issue. First, a simple baseline that filters out retrieved passages that do not entail question-answer pairs according to a natural language inference (NLI) model. This is effective in preventing performance reduction, but at a cost of also discarding relevant passages. Thus, we propose a method for automatically generating data to fine-tune the language model to properly leverage retrieved passages, using a mix of relevant and irrelevant contexts at training time. We empirically show that even 1,000 examples suffice to train the model to be robust to irrelevant contexts while maintaining high performance on examples with relevant ones.
Shu Zhao, Huijuan Xu
Composed image retrieval which combines a reference image and a text modifier to identify the desired target image is a challenging task, and requires the model to comprehend both vision and language modalities and their interactions. Existing approaches focus on holistic multi-modal interaction modeling, and ignore the composed and complimentary property between the reference image and text modifier. In order to better utilize the complementarity of multi-modal inputs for effective information fusion and retrieval, we move the multi-modal understanding to fine-granularity at concept-level, and learn the multi-modal concept alignment to identify the visual location in reference or target images corresponding to text modifier. Toward the end, we propose a NEUral COncept REasoning (NEUCORE) model which incorporates multi-modal concept alignment and progressive multimodal fusion over aligned concepts. Specifically, considering that text modifier may refer to semantic concepts not existing in the reference image and requiring to be added into the target image, we learn the multi-modal concept alignment between the text modifier and the concatenation of reference and target images, under multiple-instance learning framework with image and sentence level weak supervision. Furthermore, based on aligned concepts, to form discriminative fusion features of the input modalities for accurate target image retrieval, we propose a progressive fusion strategy with unified execution architecture instantiated by the attended language semantic concepts. Our proposed approach is evaluated on three datasets and achieves state-of-the-art results.
Yun-Ning Hung, Ju-Chiang Wang, Minz Won, Duc Le
In the era of data-driven Music Information Retrieval (MIR), the scarcity of labeled data has been one of the major concerns to the success of an MIR task. In this work, we leverage the semi-supervised teacher-student training approach to improve MIR tasks. For training, we scale up the unlabeled music data to 240k hours, which is much larger than any public MIR datasets. We iteratively create and refine the pseudo-labels in the noisy teacher-student training process. Knowledge expansion is also explored to iteratively scale up the model sizes from as small as less than 3M to almost 100M parameters. We study the performance correlation between data size and model size in the experiments. By scaling up both model size and training data, our models achieve state-of-the-art results on several MIR tasks compared to models that are either trained in a supervised manner or based on a self-supervised pretrained model. To our knowledge, this is the first attempt to study the effects of scaling up both model and training data for a variety of MIR tasks.
Qingqing Cao, Sewon Min, Yizhong Wang, Hannaneh Hajishirzi
Retrieval augmentation addresses many critical problems in large language models such as hallucination, staleness, and privacy leaks. However, running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text. We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages, significantly reducing computation during inference. Despite the potential loss of accuracy, our new calibration techniques and training objectives restore performance. Combined with offline and runtime compression, this only requires 127GB of disk space for encoding 3 billion tokens in Wikipedia. Our experiments show that on five knowledge-intensive NLP tasks, BTR accelerates state-of-the-art inference by up to 4x and reduces storage by over 100x while maintaining over 95% task performance.
Yiyao Yu, Junjie Wang, Yuxiang Zhang, Lin Zhang, Yujiu Yang, Tetsuya Sakai
Artificial intelligence (AI) technologies should adhere to human norms to better serve our society and avoid disseminating harmful or misleading information, particularly in Conversational Information Retrieval (CIR). Previous work, including approaches and datasets, has not always been successful or sufficiently robust in taking human norms into consideration. To this end, we introduce a workflow that integrates ethical alignment, with an initial ethical judgment stage for efficient data screening. To address the need for ethical judgment in CIR, we present the QA-ETHICS dataset, adapted from the ETHICS benchmark, which serves as an evaluation tool by unifying scenarios and label meanings. However, each scenario only considers one ethical concept. Therefore, we introduce the MP-ETHICS dataset to evaluate a scenario under multiple ethical concepts, such as justice and Deontology. In addition, we suggest a new approach that achieves top performance in both binary and multi-label ethical judgment tasks. Our research provides a practical method for introducing ethical alignment into the CIR workflow. The data and code are available at https://github.com/wanng-ide/ealm .
Hao Li, Jingkuan Song, Lianli Gao, Xiaosu Zhu, Heng Tao Shen
Cross-modal Retrieval methods build similarity relations between vision and
language modalities by jointly learning a common representation space. However,
the predictions are often unreliable due to the Aleatoric uncertainty, which is
induced by low-quality data, e.g., corrupt images, fast-paced videos, and
non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric
Uncertainty Quantification (PAU) framework to provide trustworthy predictions
by quantifying the uncertainty arisen from the inherent data ambiguity.
Concretely, we first construct a set of various learnable prototypes for each
modality to represent the entire semantics subspace. Then Dempster-Shafer
Theory and Subjective Logic Theory are utilized to build an evidential
theoretical framework by associating evidence with Dirichlet Distribution
parameters. The PAU model induces accurate uncertainty and reliable predictions
for cross-modal retrieval. Extensive experiments are performed on four major
benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the
effectiveness of our method. The code is accessible at
https://github.com/leolee99/PAU.
Authors' comments: Accepted to NeurIPS 2023
Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi R. Fung, Hao Peng, Heng Ji
Large language models (LLMs) are often augmented with tools to solve complex
tasks. By generating code snippets and executing them through task-specific
Application Programming Interfaces (APIs), they can offload certain functions
to dedicated external modules, such as image encoding and performing
calculations. However, most existing approaches to augment LLMs with tools are
constrained by general-purpose APIs and lack the flexibility for tailoring them
to specific tasks. In this work, we present CRAFT, a general tool creation and
retrieval framework for LLMs. It creates toolsets specifically curated for the
tasks and equips LLMs with a component that retrieves tools from these sets to
enhance their capability to solve complex tasks. For each task, we collect
specific code solutions by prompting GPT-4 to solve the training examples.
Following a validation step ensuring the correctness, these solutions are
abstracted into code snippets to enhance reusability, and deduplicated for
higher quality. At inference time, the language model retrieves snippets from
the toolsets and then executes them or generates the output conditioning on the
retrieved snippets. Our method is designed to be flexible and offers a
plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and
modalities, without any finetuning. Experiments on vision-language, tabular
processing, and mathematical reasoning tasks show that our approach achieves
substantial improvements compared to strong baselines. In addition, our
in-depth analysis reveals that: (1) consistent performance improvement can be
achieved by scaling up the number of tools and the capability of the backbone
models; (2) each component of our approach contributes to the performance
gains; (3) the created tools are well-structured and reliable with low
complexity and atomicity. The code is available at
https://github.com/lifan-yuan/CRAFT.
Authors' comments: Accepted to ICLR 2024. Code is available at
https://github.com/lifan-yuan/CRAFT
Tobias Weitz, Christian Heide, Peter Hommelhoff
When Bloch electrons in a solid are exposed to a strong optical field, they are coherently driven in their respective bands where they acquire a quantum phase as the imprint of the band shape. If an electron approaches an avoided crossing formed by two bands, it may be split by undergoing a Landau-Zener transition. We here employ subsequent Landau-Zener transitions to realize strong-field Bloch electron interferometry (SFBEI), allowing us to reveal band structure information. In particular, we measure the Fermi velocity (band slope) of graphene in the vicinity of the K points as (1.07$\pm$0.04) nm fs$^{-1}$. We expect SFBEI for band structure retrieval to apply to a wide range of material systems and experimental conditions, making it suitable for studying transient changes in band structure with femtosecond temporal resolution at ambient conditions.
Pengxiang Wu, Siman Wang, Kevin Dela Rosa, Derek Hao Hu
Image retrieval is a fundamental task in computer vision. Despite recent
advances in this field, many techniques have been evaluated on a limited number
of domains, with a small number of instance categories. Notably, most existing
works only consider domains like 3D landmarks, making it difficult to
generalize the conclusions made by these works to other domains, e.g., logo and
other 2D flat objects. To bridge this gap, we introduce a new dataset for
benchmarking visual search methods on flat images with diverse patterns. Our
flat object retrieval benchmark (FORB) supplements the commonly adopted 3D
object domain, and more importantly, it serves as a testbed for assessing the
image embedding quality on out-of-distribution domains. In this benchmark we
investigate the retrieval accuracy of representative methods in terms of
candidate ranks, as well as matching score margin, a viewpoint which is largely
ignored by many works. Our experiments not only highlight the challenges and
rich heterogeneity of FORB, but also reveal the hidden properties of different
retrieval strategies. The proposed benchmark is a growing project and we expect
to expand in both quantity and variety of objects. The dataset and supporting
codes are available at https://github.com/pxiangwu/FORB/.
Authors' comments: NeurIPS 2023 Datasets and Benchmarks Track
Tom Bamford, Andrea Coletta, Elizabeth Fons, Sriram Gopalakrishnan, Svitlana Vyetrenko, Tucker Balch, Manuela Veloso
Financial firms commonly process and store billions of time-series data,
generated continuously and at a high frequency. To support efficient data
storage and retrieval, specialized time-series databases and systems have
emerged. These databases support indexing and querying of time-series by a
constrained Structured Query Language(SQL)-like format to enable queries like
"Stocks with monthly price returns greater than 5%", and expressed in rigid
formats. However, such queries do not capture the intrinsic complexity of high
dimensional time-series data, which can often be better described by images or
language (e.g., "A stock in low volatility regime"). Moreover, the required
storage, computational time, and retrieval complexity to search in the
time-series space are often non-trivial. In this paper, we propose and
demonstrate a framework to store multi-modal data for financial time-series in
a lower-dimensional latent space using deep encoders, such that the latent
space projections capture not only the time series trends but also other
desirable information or properties of the financial time-series data (such as
price volatility). Moreover, our approach allows user-friendly query
interfaces, enabling natural language text or sketches of time-series, for
which we have developed intuitive interfaces. We demonstrate the advantages of
our method in terms of computational efficiency and accuracy on real historical
data as well as synthetic data, and highlight the utility of latent-space
projections in the storage and retrieval of financial time-series data with
intuitive query modalities.
Authors' comments: Accepted to ICAIF 2023
Yucheng Shi, Shaochen Xu, Tianze Yang, Zhengliang Liu, Tianming Liu, Quanzheng Li, Xiang Li, Ninghao Liu
Large Language Models (LLMs), although powerful in general domains, often
perform poorly on domain-specific tasks such as medical question answering
(QA). In addition, LLMs tend to function as "black-boxes", making it
challenging to modify their behavior. To address the problem, our work employs
a transparent process of retrieval augmented generation (RAG), aiming to
improve LLM responses without the need for fine-tuning or retraining.
Specifically, we propose a comprehensive retrieval strategy to extract medical
facts from an external knowledge base, and then inject them into the LLM's
query prompt. Focusing on medical QA, we evaluate the impact of different
retrieval models and the number of facts on LLM performance using the
MedQA-SMILE dataset. Notably, our retrieval-augmented Vicuna-7B model exhibited
an accuracy improvement from 44.46% to 48.54%. This work underscores the
potential of RAG to enhance LLM performance, offering a practical approach to
mitigate the challenges posed by black-box LLMs.
Authors' comments: Accepted by AMIA 2024 Annual Symposium
Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata
Retrieving adverbs that describe an action in a video poses a crucial step
towards fine-grained video understanding. We propose a framework for
video-to-adverb retrieval (and vice versa) that aligns video embeddings with
their matching compositional adverb-action text embedding in a joint embedding
space. The compositional adverb-action text embedding is learned using a
residual gating mechanism, along with a novel training objective consisting of
triplet losses and a regression target. Our method achieves state-of-the-art
performance on five recent benchmarks for video-adverb retrieval. Furthermore,
we introduce dataset splits to benchmark video-adverb retrieval for unseen
adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet
Adverbs datasets. Our proposed framework outperforms all prior works for the
generalisation task of retrieving adverbs from videos for unseen adverb-action
compositions. Code and dataset splits are available at
https://hummelth.github.io/ReGaDa/.
Authors' comments: BMVC 2023 (Oral)
Hila Levi, Guy Heller, Dan Levi, Ethan Fetaya
The task of open-vocabulary object-centric image retrieval involves the
retrieval of images containing a specified object of interest, delineated by an
open-set text query. As working on large image datasets becomes standard,
solving this task efficiently has gained significant practical importance.
Applications include targeted performance analysis of retrieved images using
ad-hoc queries and hard example mining during training. Recent advancements in
contrastive-based open vocabulary systems have yielded remarkable
breakthroughs, facilitating large-scale open vocabulary image retrieval.
However, these approaches use a single global embedding per image, thereby
constraining the system's ability to retrieve images containing relatively
small object instances. Alternatively, incorporating local embeddings from
detection pipelines faces scalability challenges, making it unsuitable for
retrieval from large databases.
In this work, we present a simple yet effective approach to object-centric
open-vocabulary image retrieval. Our approach aggregates dense embeddings
extracted from CLIP into a compact representation, essentially combining the
scalability of image retrieval pipelines with the object identification
capabilities of dense detection methods. We show the effectiveness of our
scheme to the task by achieving significantly better results than global
feature approaches on three datasets, increasing accuracy by up to 15 mAP
points. We further integrate our scheme into a large scale retrieval framework
and demonstrate our method's advantages in terms of scalability and
interpretability.
Authors' comments: BMVC 2023
Luis Carvalho, Gerhard Widmer
A range of applications of multi-modal music information retrieval is centred
around the problem of connecting large collections of sheet music (images) to
corresponding audio recordings, that is, identifying pairs of audio and score
excerpts that refer to the same musical content. One of the typical and most
recent approaches to this task employs cross-modal deep learning architectures
to learn joint embedding spaces that link the two distinct modalities - audio
and sheet music images. While there has been steady improvement on this front
over the past years, a number of open problems still prevent large-scale
employment of this methodology. In this article we attempt to provide an
insightful examination of the current developments on audio-sheet music
retrieval via deep learning methods. We first identify a set of main challenges
on the road towards robust and large-scale cross-modal music retrieval in real
scenarios. We then highlight the steps we have taken so far to address some of
these challenges, documenting step-by-step improvement along several
dimensions. We conclude by analysing the remaining challenges and present ideas
for solving these, in order to pave the way to a unified and robust methodology
for cross-modal music retrieval.
Authors' comments: Proceedings of the IEEE 6th International Conference on Multimedia
Information Processing and Retrieval (MIPR)
Luis Carvalho, Gerhard Widmer
Many applications of cross-modal music retrieval are related to connecting
sheet music images to audio recordings. A typical and recent approach to this
is to learn, via deep neural networks, a joint embedding space that correlates
short fixed-size snippets of audio and sheet music by means of an appropriate
similarity structure. However, two challenges that arise out of this strategy
are the requirement of strongly aligned data to train the networks, and the
inherent discrepancies of musical content between audio and sheet music
snippets caused by local and global tempo differences. In this paper, we
address these two shortcomings by designing a cross-modal recurrent network
that learns joint embeddings that can summarize longer passages of
corresponding audio and sheet music. The benefits of our method are that it
only requires weakly aligned audio-sheet music pairs, as well as that the
recurrent network handles the non-linearities caused by tempo variations
between audio and sheet music. We conduct a number of experiments on synthetic
and real piano data and scores, showing that our proposed recurrent method
leads to more accurate retrieval in all possible configurations.
Authors' comments: In Proceedings of the 24th Conference of the International Society
for Music Information Retrieval (ISMIR 2023), Milan, Italy