Jiaming Zhou, Shiwan Zhao, Yaqi Liu, Wenjia Zeng, Yong Chen, Yong Qin
The success of retrieval-augmented language models in various natural
language processing (NLP) tasks has been constrained in automatic speech
recognition (ASR) applications due to challenges in constructing fine-grained
audio-text datastores. This paper presents kNN-CTC, a novel approach that
overcomes these challenges by leveraging Connectionist Temporal Classification
(CTC) pseudo labels to establish frame-level audio-text key-value pairs,
circumventing the need for precise ground truth alignments. We further
introduce a skip-blank strategy, which strategically ignores CTC blank frames,
to reduce datastore size. kNN-CTC incorporates a k-nearest neighbors retrieval
mechanism into pre-trained CTC ASR systems, achieving significant improvements
in performance. By incorporating a k-nearest neighbors retrieval mechanism into
pre-trained CTC ASR systems and leveraging a fine-grained, pruned datastore,
kNN-CTC consistently achieves substantial improvements in performance under
various experimental settings. Our code is available at
https://github.com/NKU-HLT/KNN-CTC.
Authors' comments: Accepted by ICASSP 2024
Norman Di Palo, Edward Johns
Imitation learning with visual observations is notoriously inefficient when
addressed with end-to-end behavioural cloning methods. In this paper, we
explore an alternative paradigm which decomposes reasoning into three phases.
First, a retrieval phase, which informs the robot what it can do with an
object. Second, an alignment phase, which informs the robot where to interact
with the object. And third, a replay phase, which informs the robot how to
interact with the object. Through a series of real-world experiments on
everyday tasks, such as grasping, pouring, and inserting objects, we show that
this decomposition brings unprecedented learning efficiency, and effective
inter- and intra-class generalisation. Videos are available at
https://www.robot-learning.uk/retrieval-alignment-replay.
Authors' comments: Published in IEEE Robotics and Automation Letters (RA-L). (Accepted
December 2023)
Chun-Mei Feng, Yang Bai, Tao Luo, Zhen Li, Salman Khan, Wangmeng Zuo, Xinxing Xu, Rick Siow Mong Goh et al.
Albeit progress has been made in Composed Image Retrieval (CIR), we empirically find that a certain percentage of failure retrieval results are not consistent with their relative captions. To address this issue, this work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR. The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods. Given the top-C retrieved images by a CIR method, VQA4CIR aims to decrease the adverse effect of the failure retrieval results being inconsistent with the relative caption. To find the retrieved images inconsistent with the relative caption, we resort to the "QA generation to VQA" self-verification pipeline. For QA generation, we suggest fine-tuning LLM (e.g., LLaMA) to generate several pairs of questions and answers from each relative caption. We then fine-tune LVLM (e.g., LLaVA) to obtain the VQA model. By feeding the retrieved image and question to the VQA model, one can find the images inconsistent with relative caption when the answer by VQA is inconsistent with the answer in the QA pair. Consequently, the CIR performance can be boosted by modifying the ranks of inconsistently retrieved images. Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.
Wenhao Ding, Yulong Cao, Ding Zhao, Chaowei Xiao, Marco Pavone
Simulation plays a crucial role in the development of autonomous vehicles
(AVs) due to the potential risks associated with real-world testing. Although
significant progress has been made in the visual aspects of simulators,
generating complex behavior among agents remains a formidable challenge. It is
not only imperative to ensure realism in the scenarios generated but also
essential to incorporate preferences and conditions to facilitate controllable
generation for AV training and evaluation. Traditional methods, mainly relying
on memorizing the distribution of training datasets, often fall short in
generating unseen scenarios. Inspired by the success of retrieval augmented
generation in large language models, we present RealGen, a novel
retrieval-based in-context learning framework for traffic scenario generation.
RealGen synthesizes new scenarios by combining behaviors from multiple
retrieved examples in a gradient-free way, which may originate from templates
or tagged scenarios. This in-context learning framework endows versatile
generative capabilities, including the ability to edit scenarios, compose
various behaviors, and produce critical scenarios. Evaluations show that
RealGen offers considerable flexibility and controllability, marking a new
direction in the field of controllable traffic scenario generation. Check our
project website for more information: https://realgen.github.io.
Authors' comments: Accepted by ECCV 2024, Oral
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun et al.
Large Language Models (LLMs) showcase impressive capabilities but encounter
challenges like hallucination, outdated knowledge, and non-transparent,
untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has
emerged as a promising solution by incorporating knowledge from external
databases. This enhances the accuracy and credibility of the generation,
particularly for knowledge-intensive tasks, and allows for continuous knowledge
updates and integration of domain-specific information. RAG synergistically
merges LLMs' intrinsic knowledge with the vast, dynamic repositories of
external databases. This comprehensive review paper offers a detailed
examination of the progression of RAG paradigms, encompassing the Naive RAG,
the Advanced RAG, and the Modular RAG. It meticulously scrutinizes the
tripartite foundation of RAG frameworks, which includes the retrieval, the
generation and the augmentation techniques. The paper highlights the
state-of-the-art technologies embedded in each of these critical components,
providing a profound understanding of the advancements in RAG systems.
Furthermore, this paper introduces up-to-date evaluation framework and
benchmark. At the end, this article delineates the challenges currently faced
and points out prospective avenues for research and development.
Authors' comments: Ongoing Work
Run-Ze Fan, Yixing Fan, Jiangui Chen, Jiafeng Guo, Ruqing Zhang, Xueqi Cheng
Automatic mainstream hashtag recommendation aims to accurately provide users
with concise and popular topical hashtags before publication. Generally,
mainstream hashtag recommendation faces challenges in the comprehensive
difficulty of newly posted tweets in response to new topics, and the accurate
identification of mainstream hashtags beyond semantic correctness. However,
previous retrieval-based methods based on a fixed predefined mainstream hashtag
list excel in producing mainstream hashtags, but fail to understand the
constant flow of up-to-date information. Conversely, generation-based methods
demonstrate a superior ability to comprehend newly posted tweets, but their
capacity is constrained to identifying mainstream hashtags without additional
features. Inspired by the recent success of the retrieval-augmented technique,
in this work, we attempt to adopt this framework to combine the advantages of
both approaches. Meantime, with the help of the generator component, we could
rethink how to further improve the quality of the retriever component at a low
cost. Therefore, we propose RetrIeval-augmented Generative Mainstream HashTag
Recommender (RIGHT), which consists of three components: 1) a retriever seeks
relevant hashtags from the entire tweet-hashtags set; 2) a selector enhances
mainstream identification by introducing global signals; and 3) a generator
incorporates input tweets and selected hashtags to directly generate the
desired hashtags. The experimental results show that our method achieves
significant improvements over state-of-the-art baselines. Moreover, RIGHT can
be easily integrated into large language models, improving the performance of
ChatGPT by more than 10%.
Authors' comments: Accepted by ECIR2024 full paper
Zhenxi Lin, Ziheng Zhang, Xian Wu, Yefeng Zheng
Biomedical entity linking (BioEL) has achieved remarkable progress with the
help of pre-trained language models. However, existing BioEL methods usually
struggle to handle rare and difficult entities due to long-tailed distribution.
To address this limitation, we introduce a new scheme $k$NN-BioEL, which
provides a BioEL model with the ability to reference similar instances from the
entire training corpus as clues for prediction, thus improving the
generalization capabilities. Moreover, we design a contrastive learning
objective with dynamic hard negative sampling (DHNS) that improves the quality
of the retrieved neighbors during inference. Extensive experimental results
show that $k$NN-BioEL outperforms state-of-the-art baselines on several
datasets.
Authors' comments: Accepted by ICASSP 2024
Joel Yeo, Benedikt J. Daurer, Dari Kimanius, Deepan Balakrishnan, Tristan Bepler, Yong Zi Tan, N. Duane Loh
Ewald sphere curvature correction, which extends beyond the projection
approximation, stretches the shallow depth of field in cryo-EM reconstructions
of thick particles. Here we show that even for previously assumed thin
particles, reconstruction artifacts which we refer to as ghosts can appear. By
retrieving the lost phases of the electron exitwaves and accounting for the
first Born approximation scattering within the particle, we show that these
ghosts can be effectively eliminated. Our simulations demonstrate how such
ghostbusting can improve reconstructions as compared to existing
state-of-the-art software. Like ptychographic cryo-EM, our Ghostbuster
algorithm uses phase retrieval to improve reconstructions, but unlike the
former, we do not need to modify the existing data acquisition pipelines.
Authors' comments: 20 pages, 11 figures. Submitted to IUCrJ
Joan Figuerola Hurtado
The paper presents a methodology for uncovering knowledge gaps on the internet using the Retrieval Augmented Generation (RAG) model. By simulating user search behaviour, the RAG system identifies and addresses gaps in information retrieval systems. The study demonstrates the effectiveness of the RAG system in generating relevant suggestions with a consistent accuracy of 93%. The methodology can be applied in various fields such as scientific discovery, educational enhancement, research development, market analysis, search engine optimisation, and content development. The results highlight the value of identifying and understanding knowledge gaps to guide future endeavours.
Jenny Hamer, Eleni Triantafillou, Bart van Merrienboer, Stefan Kahl, Holger Klinck, Tom Denton, Vincent Dumoulin
The ability for a machine learning model to cope with differences in training and deployment conditions--e.g. in the presence of distribution shift or the generalization to new classes altogether--is crucial for real-world use cases. However, most empirical work in this area has focused on the image domain with artificial benchmarks constructed to measure individual aspects of generalization. We present BIRB, a complex benchmark centered on the retrieval of bird vocalizations from passively-recorded datasets given focal recordings from a large citizen science corpus available for training. We propose a baseline system for this collection of tasks using representation learning and a nearest-centroid search. Our thorough empirical evaluation and analysis surfaces open research directions, suggesting that BIRB fills the need for a more realistic and complex benchmark to drive progress on robustness to distribution shifts and generalization of ML models.
Jian Zhu, Yu Cui, Zhangmin Huang, Xingyu Li, Lei Liu, Lingfang Zeng, Li-Rong Dai
The multi-view hash method converts heterogeneous data from multiple views
into binary hash codes, which is one of the critical technologies in multimedia
retrieval. However, the current methods mainly explore the complementarity
among multiple views while lacking confidence learning and fusion. Moreover, in
practical application scenarios, the single-view data contain redundant noise.
To conduct the confidence learning and eliminate unnecessary noise, we propose
a novel Adaptive Confidence Multi-View Hashing (ACMVH) method. First, a
confidence network is developed to extract useful information from various
single-view features and remove noise information. Furthermore, an adaptive
confidence multi-view network is employed to measure the confidence of each
view and then fuse multi-view features through a weighted summation. Lastly, a
dilation network is designed to further enhance the feature representation of
the fused features. To the best of our knowledge, we pioneer the application of
confidence learning into the field of multimedia retrieval. Extensive
experiments on two public datasets show that the proposed ACMVH performs better
than state-of-the-art methods (maximum increase of 3.24%). The source code is
available at https://github.com/HackerHyper/ACMVH.
Authors' comments: accepted by International Conference on Acoustics, Speech and Signal
Processing 2024
Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Qian Li, Chao Shen
Adversarial training has achieved substantial performance in defending image retrieval against adversarial examples. However, existing studies in deep metric learning (DML) still suffer from two major limitations: weak adversary and model collapse. In this paper, we address these two limitations by proposing collapse-aware triplet decoupling (CA-TRIDE). Specifically, TRIDE yields a strong adversary by spatially decoupling the perturbation targets into the anchor and the other candidates. Furthermore, CA prevents the consequential model collapse, based on a novel metric, collapseness, which is incorporated into the optimization of perturbation. We also identify two drawbacks of the existing robustness metric in image retrieval and propose a new metric for a more reasonable robustness evaluation. Extensive experiments on three datasets demonstrate that CA-TRIDE outperforms existing defense methods in both conventional and new metrics.
Gao Huang, Song Li, Hang Xu
We investigate the phase retrieval problem perturbed by dense bounded noise
and sparse outliers that can change an adversarially chosen $s$-fraction of the
measurement vector. The adversarial sparse outliers may exhibit dependence on
both the observation and the measurement. We demonstrate that the nonlinear
least absolute deviation based on amplitude measurement can tolerate
adversarial outliers at a fraction of $s^{*,1}\approx0.2043$, while the
intensity-based model can tolerate a fraction of $s^{*,2}\approx0.1185$.
Furthermore, we construct adaptive counterexamples to show that the thresholds
are theoretically sharp, thereby showing the presentation of phase transition
in the adversarial phase retrieval problem when the corruption fraction exceeds
the sharp thresholds. This implies that the amplitude-based model exhibits
superior adversarial robustness in comparison with the intensity-based model.
Corresponding experimental results are presented to further illustrate our
theoretical findings. To the best of our knowledge, our results provide the
first theoretical examination of the distinction in robustness performance
between amplitude and intensity measurement. A crucial point of our analysis is
that we explore the exact distribution of some combination of two
non-independent Gaussian random variables and present the novel probability
density functions to derive the sharp thresholds.
Authors' comments: 32 pages
Christos Plachouras, Pablo Alonson-Jimnez, Dmitry Bogdanov
Music Information Retrieval (MIR) research is increasingly leveraging
representation learning to obtain more compact, powerful music audio
representations for various downstream MIR tasks. However, current
representation evaluation methods are fragmented due to discrepancies in audio
and label preprocessing, downstream model and metric implementations, data
availability, and computational resources, often leading to inconsistent and
limited results. In this work, we introduce mir_ref, an MIR Representation
Evaluation Framework focused on seamless, transparent, local-first experiment
orchestration to support representation development. It features
implementations of a variety of components such as MIR datasets, tasks,
embedding models, and tools for result analysis and visualization, while
facilitating the implementation of custom components. To demonstrate its
utility, we use it to conduct an extensive evaluation of several embedding
models across various tasks and datasets, including evaluating their robustness
to various audio perturbations and the ease of extracting relevant information
from them.
Authors' comments: Machine Learning for Audio Workshop, Neural Information Processing
Systems (NeurIPS) 2023, New Orleans, LA
Matteo Allaix
In the era of extensive data growth, robust and efficient mechanisms are
needed to store and manage vast amounts of digital information, such as Data
Storage Systems (DSSs). Concurrently, privacy concerns have arisen, leading to
the development of techniques like Private Information Retrieval (PIR) to
enable data access while preserving privacy. A PIR protocol allows users to
retrieve information from a database without revealing the specifics of their
query or the data they are accessing.
With the advent of quantum computing, researchers have explored the potential
of using quantum systems to enhance privacy in information retrieval. In a
Quantum Private Information Retrieval (QPIR) protocol, a user can retrieve
information from a database by downloading quantum systems from multiple
servers, while ensuring that the servers remain oblivious to the specific
information being accessed. This scenario offers a unique advantage by
leveraging the inherent properties of quantum systems to provide enhanced
privacy guarantees and improved communication rates compared to classical PIR
protocols.
In this thesis we consider the QPIR setting where the queries and the coded
storage systems are classical, while the responses from the servers are
quantum. This problem was treated by Song et al. for replicated storage and
different collusion patterns. This thesis aims to develop QPIR protocols for
coded storage by combining known classical PIR protocols with quantum
communication algorithms, achieving enhanced privacy and communication costs.
We consider different storage codes and robustness assumptions, and we prove
that the achieved communication cost is always lower than the classical
counterparts.
Authors' comments: This is the summary part of an article collection-based PhD thesis
Oded Ovadia, Menachem Brief, Moshik Mishaeli, Oren Elisha
Large language models (LLMs) encapsulate a vast amount of factual information within their pre-trained weights, as evidenced by their ability to answer diverse questions across different domains. However, this knowledge is inherently limited, relying heavily on the characteristics of the training data. Consequently, using external datasets to incorporate new information or refine the capabilities of LLMs on previously seen information poses a significant challenge. In this study, we compare two common approaches: unsupervised fine-tuning and retrieval-augmented generation (RAG). We evaluate both approaches on a variety of knowledge-intensive tasks across different topics. Our findings reveal that while unsupervised fine-tuning offers some improvement, RAG consistently outperforms it, both for existing knowledge encountered during training and entirely new knowledge. Moreover, we find that LLMs struggle to learn new factual information through unsupervised fine-tuning, and that exposing them to numerous variations of the same fact during training could alleviate this problem.
Jakub Lála, Odhran O'Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, Andrew D. White
Large Language Models (LLMs) generalize well across language tasks, but suffer from hallucinations and uninterpretability, making it difficult to assess their accuracy without ground-truth. Retrieval-Augmented Generation (RAG) models have been proposed to reduce hallucinations and provide provenance for how an answer was generated. Applying such models to the scientific literature may enable large-scale, systematic processing of scientific knowledge. We present PaperQA, a RAG agent for answering questions over the scientific literature. PaperQA is an agent that performs information retrieval across full-text scientific articles, assesses the relevance of sources and passages, and uses RAG to provide answers. Viewing this agent as a question answering model, we find it exceeds performance of existing LLMs and LLM agents on current science QA benchmarks. To push the field closer to how humans perform research on scientific literature, we also introduce LitQA, a more complex benchmark that requires retrieval and synthesis of information from full-text scientific papers across the literature. Finally, we demonstrate PaperQA's matches expert human researchers on LitQA.
Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, Yan Lu
The remarkable natural language understanding, reasoning, and generation
capabilities of large language models (LLMs) have made them attractive for
application to video understanding, utilizing video tokens as contextual input.
However, employing LLMs for long video understanding presents significant
challenges. The extensive number of video tokens leads to considerable
computational costs for LLMs while using aggregated tokens results in loss of
vision details. Moreover, the presence of abundant question-irrelevant tokens
introduces noise to the video reasoning process. To address these issues, we
introduce a simple yet effective learnable retrieval-based video-language model
(R-VLM) for efficient long video understanding. Specifically, given a question
(query) and a long video, our model identifies and selects the most relevant K
video chunks and uses their associated visual tokens to serve as context for
the LLM inference. This effectively reduces the number of video tokens,
eliminates noise interference, and enhances system performance. We achieve this
by incorporating a learnable lightweight MLP block to facilitate the efficient
retrieval of question-relevant chunks, through the end-to-end training of our
video-language model with a proposed soft matching loss. Our experimental
results on multiple zero-shot video question answering datasets validate the
effectiveness of our framework for comprehending long videos.
Authors' comments: 14 pages, 8 figures
Stephen Brade, Bryan Wang, Mauricio Sousa, Gregory Lee Newsome, Sageev Oore, Tovi Grossman
Synthesizers are powerful tools that allow musicians to create dynamic and original sounds. Existing commercial interfaces for synthesizers typically require musicians to interact with complex low-level parameters or to manage large libraries of premade sounds. To address these challenges, we implement SynthScribe -- a fullstack system that uses multimodal deep learning to let users express their intentions at a much higher level. We implement features which address a number of difficulties, namely 1) searching through existing sounds, 2) creating completely new sounds, 3) making meaningful modifications to a given sound. This is achieved with three main features: a multimodal search engine for a large library of synthesizer sounds; a user centered genetic algorithm by which completely new sounds can be created and selected given the users preferences; a sound editing support feature which highlights and gives examples for key control parameters with respect to a text or audio based query. The results of our user studies show SynthScribe is capable of reliably retrieving and modifying sounds while also affording the ability to create completely new sounds that expand a musicians creative horizon.
Claudio Spiess
Software reverse engineering is an essential task in software engineering and security, but it can be a challenging process, especially for adversarial artifacts. To address this challenge, we present STraceBERT, a novel approach that utilizes a Java dynamic analysis tool to record calls to core Java libraries, and pretrain a BERT-style model on the recorded application traces for effective method source code retrieval from a candidate set. Our experiments demonstrate the effectiveness of STraceBERT in retrieving the source code compared to existing approaches. Our proposed approach offers a promising solution to the problem of code retrieval in software reverse engineering and opens up new avenues for further research in this area.