Anni Yue, Stephen L. Smith
Robotic-based compact storage and retrieval systems provide high-density
storage in distribution center and warehouse applications. In the system, items
are stored in bins, and the bins are organized inside a three-dimensional grid.
Robots move on top of the grid to retrieve and deliver bins. To retrieve a bin,
a robot removes all bins above one by one with its gripper, called bin digging.
The closer the target bin is to the top of the grid, the less digging is
required to retrieve the bin. In this paper, we propose a policy to optimally
arrange the bins in the grid while processing bin requests so that the most
frequently accessed bins remain near the top of the grid. This improves the
performance of the system and makes it responsive to changes in bin demand. Our
solution approach identifies the optimal bin arrangement in the storage
facility, initiates a transition to this optimal set-up, and subsequently
ensures the ongoing maintenance of this arrangement for optimal performance. We
perform extensive simulations on a custom-built discrete event model of the
system. Our simulation results show that under the proposed policy more than
half of the bins requested are located on top of the grid, reducing bin digging
compared to existing policies. Compared to existing approaches, the proposed
policy reduces the retrieval time of the requested bins by over 30% and the
number of bin requests that exceed certain time thresholds by nearly 50%.
Authors' comments: 35 pages, 16 figures, submitted to Transportation Science (INFORMS)
Haoran Tang, Jieren Deng, Zhihong Pan, Hao Tian, Pratik Chaudhari, Xin Zhou
Diffusion-based methods have demonstrated remarkable capabilities in generating a diverse array of high-quality images, sparking interests for styled avatars, virtual try-on, and more. Previous methods use the same reference image as the target. An overlooked aspect is the leakage of the target's spatial information, style, etc. from the reference, harming the generated diversity and causing shortcuts. However, this approach continues as widely available datasets usually consist of single images not grouped by identities, and it is expensive to recollect large-scale same-identity data. Moreover, existing metrics adopt decoupled evaluation on text alignment and identity preservation, which fail at distinguishing between balanced outputs and those that over-fit to one aspect. In this paper, we propose a multi-level, same-identity dataset RetriBooru, which groups anime characters by both face and cloth identities. RetriBooru enables adopting reference images of the same character and outfits as the target, while keeping flexible gestures and actions. We benchmark previous methods on our dataset, and demonstrate the effectiveness of training with a reference image different from target (but same identity). We introduce a new concept composition task, where the conditioning encoder learns to retrieve different concepts from several reference images, and modify a baseline network RetriNet for the new task. Finally, we introduce a novel class of metrics named Similarity Weighted Diversity (SWD), to measure the overlooked diversity and better evaluate the alignment between similarity and diversity.
So Kuroki, Mai Nishimura, Tadashi Kozuno
Due to the complex interactions between agents, learning multi-agent control
policy often requires a prohibited amount of data. This paper aims to enable
multi-agent systems to effectively utilize past memories to adapt to novel
collaborative tasks in a data-efficient fashion. We propose the Multi-Agent
Coordination Skill Database, a repository for storing a collection of
coordinated behaviors associated with key vectors distinctive to them. Our
Transformer-based skill encoder effectively captures spatio-temporal
interactions that contribute to coordination and provides a unique skill
representation for each coordinated behavior. By leveraging only a small number
of demonstrations of the target task, the database enables us to train the
policy using a dataset augmented with the retrieved demonstrations.
Experimental evaluations demonstrate that our method achieves a significantly
higher success rate in push manipulation tasks compared with baseline methods
like few-shot imitation learning. Furthermore, we validate the effectiveness of
our retrieve-and-learn framework in a real environment using a team of wheeled
robots.
Authors' comments: Published in the 2024 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2024)
Anupam Purwar, Rahul Sundar
Retrieving answers in a quick and low cost manner without hallucinations from a combination of structured and unstructured data using Language models is a major hurdle which prevents employment of Language models in knowledge retrieval automation. This becomes accentuated when one wants to integrate a speech interface. Besides, for commercial search and chatbot applications, complete reliance on commercial large language models (LLMs) like GPT 3.5 etc. can be very costly. In this work, authors have addressed this problem by first developing a keyword based search framework which augments discovery of the context to be provided to the large language model. The keywords in turn are generated by LLM and cached for comparison with keywords generated by LLM against the query raised. This significantly reduces time and cost to find the context within documents. Once the context is set, LLM uses that to provide answers based on a prompt tailored for Q&A. This research work demonstrates that use of keywords in context identification reduces the overall inference time and cost of information retrieval. Given this reduction in inference time and cost with the keyword augmented retrieval framework, a speech based interface for user input and response readout was integrated. This allowed a seamless interaction with the language model.
Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, Bill Byrne
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from existing knowledge bases to answer visually-grounded
questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong
framework to tackle KB-VQA, first retrieves related documents with Dense
Passage Retrieval (DPR) and then uses them to answer questions. This paper
proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which
significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major
limitations in RA-VQA's retriever: (1) the image representations obtained via
image-to-text transforms can be incomplete and inaccurate and (2) relevance
scores between queries and documents are computed with one-dimensional
embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes
these limitations by obtaining image representations that complement those from
the image-to-text transforms using a vision model aligned with an existing
text-based retriever through a simple alignment network. FLMR also encodes
images and questions using multi-dimensional embeddings to capture
finer-grained relevance between queries and documents. FLMR significantly
improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%.
Finally, we equipped RA-VQA with two state-of-the-art large
multi-modal/language models to achieve $\sim61\%$ VQA score in the OK-VQA
dataset.
Authors' comments: To appear at NeurIPS 2023. This is a submission version, and the
camera-ready version will be updated soon
Seongha Eom, Namgyu Ho, Jaehoon Oh, Se-Young Yun
Contrastive language-image pre-training (CLIP) has demonstrated remarkable zero-shot classification ability, namely image classification using novel text labels. Existing works have attempted to enhance CLIP by fine-tuning on downstream tasks, but these have inadvertently led to performance degradation on unseen classes, thus harming zero-shot generalization. This paper aims to address this challenge by leveraging readily available image-text pairs from an external dataset for cross-modal guidance during inference. To this end, we propose X-MoRe, a novel inference method comprising two key steps: (1) cross-modal retrieval and (2) modal-confidence-based ensemble. Given a query image, we harness the power of CLIP's cross-modal representations to retrieve relevant textual information from an external image-text pair dataset. Then, we assign higher weights to the more reliable modality between the original query image and retrieved text, contributing to the final prediction. X-MoRe demonstrates robust performance across a diverse set of tasks without the need for additional training, showcasing the effectiveness of utilizing cross-modal features to maximize CLIP's zero-shot ability.
Nils Bhne, Mark Berger, Ronald van Velzen
Nowadays, one of the critical challenges in forensics is analyzing the
enormous amounts of unstructured digital evidence, such as images. Often,
unstructured digital evidence contains precious information for forensic
investigations. Therefore, a retrieval system that can effectively identify
forensically relevant images is paramount. In this work, we explored the
effectiveness of interactive learning in improving image retrieval performance
in the forensic domain by proposing Excalibur - a zero-shot cross-modal image
retrieval system extended with interactive learning. Excalibur was evaluated
using both simulations and a user study. The simulations reveal that
interactive learning is highly effective in improving retrieval performance in
the forensic domain. Furthermore, user study participants could effectively
leverage the power of interactive learning. Finally, they considered Excalibur
effective and straightforward to use and expressed interest in using it in
their daily practice.
Authors' comments: Submitted to the AAAI22 conference
Jiahao Zhang, Haiyang Zhang, Dongmei Zhang, Yong Liu, Shen Huang
Multi-hop QA involves finding multiple relevant passages and step-by-step
reasoning to answer complex questions. While previous approaches have developed
retrieval modules for selecting relevant passages, they face challenges in
scenarios beyond two hops, owing to the limited performance of one-step methods
and the failure of two-step methods when selecting irrelevant passages in
earlier stages. In this work, we introduce Beam Retrieval, a general end-to-end
retrieval framework for multi-hop QA. This approach maintains multiple partial
hypotheses of relevant passages at each step, expanding the search space and
reducing the risk of missing relevant passages. Moreover, Beam Retrieval
jointly optimizes an encoder and two classification heads by minimizing the
combined loss across all hops. To establish a complete QA system, we
incorporate a supervised reader or a zero-shot GPT-3.5. Experimental results
demonstrate that Beam Retrieval achieves a nearly 50% improvement compared with
baselines on challenging MuSiQue-Ans, and it also surpasses all previous
retrievers on HotpotQA and 2WikiMultiHopQA. Providing high-quality context,
Beam Retrieval helps our supervised reader achieve new state-of-the-art
performance and substantially improves (up to 28.8 points) the QA performance
of zero-shot GPT-3.5.
Authors' comments: Code is available at https://github.com/canghongjian/beam_retriever
Qian Dong, Yiding Liu, Qingyao Ai, Haitao Li, Shuaiqiang Wang, Yiqun Liu, Dawei Yin, Shaoping Ma
Passage retrieval is a fundamental task in many information systems, such as
web search and question answering, where both efficiency and effectiveness are
critical concerns. In recent years, neural retrievers based on pre-trained
language models (PLM), such as dual-encoders, have achieved huge success. Yet,
studies have found that the performance of dual-encoders are often limited due
to the neglecting of the interaction information between queries and candidate
passages. Therefore, various interaction paradigms have been proposed to
improve the performance of vanilla dual-encoders. Particularly, recent
state-of-the-art methods often introduce late-interaction during the model
inference process. However, such late-interaction based methods usually bring
extensive computation and storage cost on large corpus. Despite their
effectiveness, the concern of efficiency and space footprint is still an
important factor that limits the application of interaction-based neural
retrieval models. To tackle this issue, we incorporate implicit interaction
into dual-encoders, and propose I^3 retriever. In particular, our implicit
interaction paradigm leverages generated pseudo-queries to simulate
query-passage interaction, which jointly optimizes with query and passage
encoders in an end-to-end manner. It can be fully pre-computed and cached, and
its inference process only involves simple dot product operation of the query
vector and passage vector, which makes it as efficient as the vanilla dual
encoders. We conduct comprehensive experiments on MSMARCO and TREC2019 Deep
Learning Datasets, demonstrating the I^3 retriever's superiority in terms of
both effectiveness and efficiency. Moreover, the proposed implicit interaction
is compatible with special pre-training and knowledge distillation for passage
retrieval, which brings a new state-of-the-art performance.
Authors' comments: 10 pages
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, Weizhu Chen
Large language models are powerful text processors and reasoners, but are
still subject to limitations including outdated knowledge and hallucinations,
which necessitates connecting them to the world. Retrieval-augmented large
language models have raised extensive attention for grounding model generation
on external knowledge. However, retrievers struggle to capture relevance,
especially for queries with complex information needs. Recent work has proposed
to improve relevance modeling by having large language models actively involved
in retrieval, i.e., to improve retrieval with generation. In this paper, we
show that strong performance can be achieved by a method we call Iter-RetGen,
which synergizes retrieval and generation in an iterative manner. A model
output shows what might be needed to finish a task, and thus provides an
informative context for retrieving more relevant knowledge which in turn helps
generate a better output in the next iteration. Compared with recent work which
interleaves retrieval with generation when producing an output, Iter-RetGen
processes all retrieved knowledge as a whole and largely preserves the
flexibility in generation without structural constraints. We evaluate
Iter-RetGen on multi-hop question answering, fact verification, and commonsense
reasoning, and show that it can flexibly leverage parametric knowledge and
non-parametric knowledge, and is superior to or competitive with
state-of-the-art retrieval-augmented baselines while causing fewer overheads of
retrieval and generation. We can further improve performance via
generation-augmented retrieval adaptation.
Authors' comments: Preprint
Daniel Campos, ChengXiang Zhai
Vector-based retrieval systems have become a common staple for academic and industrial search applications because they provide a simple and scalable way of extending the search to leverage contextual representations for documents and queries. As these vector-based systems rely on contextual language models, their usage commonly requires GPUs, which can be expensive and difficult to manage. Given recent advances in introducing sparsity into language models for improved inference efficiency, in this paper, we study how sparse language models can be used for dense retrieval to improve inference efficiency. Using the popular retrieval library Tevatron and the MSMARCO, NQ, and TriviaQA datasets, we find that sparse language models can be used as direct replacements with little to no drop in accuracy and up to 4.3x improved inference speeds
Gustavo Penha, Claudia Hauff
A number of learned sparse and dense retrieval approaches have recently been
proposed and proven effective in tasks such as passage retrieval and document
retrieval. In this paper we analyze with a replicability study if the lessons
learned generalize to the retrieval of responses for dialogues, an important
task for the increasingly popular field of conversational search. Unlike
passage and document retrieval where documents are usually longer than queries,
in response ranking for dialogues the queries (dialogue contexts) are often
longer than the documents (responses). Additionally, dialogues have a
particular structure, i.e. multiple utterances by different users. With these
differences in mind, we here evaluate how generalizable the following major
findings from previous works are: (F1) query expansion outperforms a
no-expansion baseline; (F2) document expansion outperforms a no-expansion
baseline; (F3) zero-shot dense retrieval underperforms sparse baselines; (F4)
dense retrieval outperforms sparse baselines; (F5) hard negative sampling is
better than random sampling for training dense models. Our experiments -- based
on three different information-seeking dialogue datasets -- reveal that four
out of five findings (F2-F5) generalize to our domain
Authors' comments: Accepted for publication in the European Conference on Information
Retrieval (ECIR'23). arXiv admin note: substantial text overlap with
arXiv:2204.10558
Parishad BehnamGhader, Santiago Miret, Siva Reddy
Augmenting pretrained language models with retrievers has shown promise in
effectively solving common NLP problems, such as language modeling and question
answering. In this paper, we evaluate the strengths and weaknesses of popular
retriever-augmented language models, namely kNN-LM, REALM, DPR + FiD,
Contriever + ATLAS, and Contriever + Flan-T5, in reasoning over retrieved
statements across different tasks. Our findings indicate that the simple
similarity metric employed by retrievers is insufficient for retrieving all the
necessary statements for reasoning. Additionally, the language models do not
exhibit strong reasoning even when provided with only the required statements.
Furthermore, when combined with imperfect retrievers, the performance of the
language models becomes even worse, e.g., Flan-T5's performance drops by 28.6%
when retrieving 5 statements using Contriever. While larger language models
improve performance, there is still a substantial room for enhancement. Our
further analysis indicates that multihop retrieve-and-read is promising for
large language models like GPT-3.5, but does not generalize to other language
models like Flan-T5-xxl.
Authors' comments: Accepted in EMNLP2023 Findings
Zhengbao Jiang, Luyu Gao, Jun Araki, Haibo Ding, Zhiruo Wang, Jamie Callan, Graham Neubig
Systems for knowledge-intensive tasks such as open-domain question answering
(QA) usually consist of two stages: efficient retrieval of relevant documents
from a large corpus and detailed reading of the selected documents to generate
answers. Retrievers and readers are usually modeled separately, which
necessitates a cumbersome implementation and is hard to train and adapt in an
end-to-end fashion. In this paper, we revisit this design and eschew the
separate architecture and training in favor of a single Transformer that
performs Retrieval as Attention (ReAtt), and end-to-end training solely based
on supervision from the end QA task. We demonstrate for the first time that a
single model trained end-to-end can achieve both competitive retrieval and QA
performance, matching or slightly outperforming state-of-the-art separately
trained retrievers and readers. Moreover, end-to-end adaptation significantly
boosts its performance on out-of-domain datasets in both supervised and
unsupervised settings, making our model a simple and adaptable solution for
knowledge-intensive tasks. Code and models are available at
https://github.com/jzbjyb/ReAtt.
Authors' comments: EMNLP 2022
Dingkun Long, Yanzhao Zhang, Guangwei Xu, Pengjun Xie
Pre-trained language model (PTM) has been shown to yield powerful text
representations for dense passage retrieval task. The Masked Language Modeling
(MLM) is a major sub-task of the pre-training process. However, we found that
the conventional random masking strategy tend to select a large number of
tokens that have limited effect on the passage retrieval task (e,g. stop-words
and punctuation). By noticing the term importance weight can provide valuable
information for passage retrieval, we hereby propose alternative retrieval
oriented masking (dubbed as ROM) strategy where more important tokens will have
a higher probability of being masked out, to capture this straightforward yet
essential information to facilitate the language model pre-training process.
Notably, the proposed new token masking method will not change the architecture
and learning objective of original PTM. Our experiments verify that the
proposed ROM enables term importance information to help language model
pre-training thus achieving better performance on multiple passage retrieval
benchmarks.
Authors' comments: Search LM part of the "AliceMind SLM + HLAR" method in MS MARCO
Passage Ranking Leaderboard Submission
Yury Zemlyanskiy, Michiel de Jong, Joshua Ainslie, Panupong Pasupat, Peter Shaw, Linlu Qiu, Sumit Sanghai, Fei Sha
A common recent approach to semantic parsing augments sequence-to-sequence
models by retrieving and appending a set of training samples, called exemplars.
The effectiveness of this recipe is limited by the ability to retrieve
informative exemplars that help produce the correct parse, which is especially
challenging in low-resource settings. Existing retrieval is commonly based on
similarity of query and exemplar inputs. We propose GandR, a retrieval
procedure that retrieves exemplars for which outputs are also similar.
GandRfirst generates a preliminary prediction with input-based retrieval. Then,
it retrieves exemplars with outputs similar to the preliminary prediction which
are used to generate a final prediction. GandR sets the state of the art on
multiple low-resource semantic parsing tasks.
Authors' comments: To appear in the proceedings of COLING 2022
Zhenghao Liu, Chenyan Xiong, Yuanhuiyi Lv, Zhiyuan Liu, Ge Yu
This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR),
which builds a unified model for multi-modal retrieval. UniVL-DR encodes
queries and multi-modality resources in an embedding space for searching
candidates from different modalities. To learn a unified embedding space for
multi-modal retrieval, UniVL-DR proposes two techniques: 1) Universal embedding
optimization strategy, which contrastively optimizes the embedding space using
the modality-balanced hard negatives; 2) Image verbalization method, which
bridges the modality gap between images and texts in the raw data space.
UniVL-DR achieves the state-of-the-art on the multi-modal open-domain question
answering benchmark, WebQA, and outperforms all retrieval models on the two
subtasks, text-text retrieval and text-image retrieval. It demonstrates that
universal multi-modal search is feasible to replace the divide-and-conquer
pipeline with a united model and also benefits single/cross modality tasks. All
source codes of this work are available at
https://github.com/OpenMatch/UniVL-DR.
Authors' comments: Accepted by ICLR 2023
Sebastian Hofstätter, Nick Craswell, Bhaskar Mitra, Hamed Zamani, Allan Hanbury
Recently, several dense retrieval (DR) models have demonstrated competitive performance to term-based retrieval that are ubiquitous in search systems. In contrast to term-based matching, DR projects queries and documents into a dense vector space and retrieves results via (approximate) nearest neighbor search. Deploying a new system, such as DR, inevitably involves tradeoffs in aspects of its performance. Established retrieval systems running at scale are usually well understood in terms of effectiveness and costs, such as query latency, indexing throughput, or storage requirements. In this work, we propose a framework with a set of criteria that go beyond simple effectiveness measures to thoroughly compare two retrieval systems with the explicit goal of assessing the readiness of one system to replace the other. This includes careful tradeoff considerations between effectiveness and various cost factors. Furthermore, we describe guardrail criteria, since even a system that is better on average may have systematic failures on a minority of queries. The guardrails check for failures on certain query characteristics and novel failure types that are only possible in dense retrieval systems. We demonstrate our decision framework on a Web ranking scenario. In that scenario, state-of-the-art DR models have surprisingly strong results, not only on average performance but passing an extensive set of guardrail tests, showing robustness on different query characteristics, lexical matching, generalization, and number of regressions. It is impossible to predict whether DR will become ubiquitous in the future, but one way this is possible is through repeated applications of decision processes such as the one presented here.
Dwaipayan Roy, Zeljko Carevic, Philipp Mayr
In this paper, we investigate the retrievability of datasets and publications
in a real-life Digital Library (DL). The measure of retrievability was
originally developed to quantify the influence that a retrieval system has on
the access to information. Retrievability can also enable DL engineers to
evaluate their search engine to determine the ease with which the content in
the collection can be accessed. Following this methodology, in our study, we
propose a system-oriented approach for studying dataset and publication
retrieval. A speciality of this paper is the focus on measuring the
accessibility biases of various types of DL items and including a metric of
usefulness. Among other metrics, we use Lorenz curves and Gini coefficients to
visualize the differences of the two retrievable document types (specifically
datasets and publications). Empirical results reported in the paper show a
distinguishable diversity in the retrievability scores among the documents of
different types.
Authors' comments: ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2022 research
paper
Wei Zhong, Jheng-Hong Yang, Yuqing Xie, Jimmy Lin
With the recent success of dense retrieval methods based on bi-encoders, studies have applied this approach to various interesting downstream retrieval tasks with good efficiency and in-domain effectiveness. Recently, we have also seen the presence of dense retrieval models in Math Information Retrieval (MIR) tasks, but the most effective systems remain classic retrieval methods that consider hand-crafted structure features. In this work, we try to combine the best of both worlds:\ a well-defined structure search method for effective formula search and efficient bi-encoder dense retrieval models to capture contextual similarities. Specifically, we have evaluated two representative bi-encoder models for token-level and passage-level dense retrieval on recent MIR tasks. Our results show that bi-encoder models are highly complementary to existing structure search methods, and we are able to advance the state-of-the-art on MIR datasets.