Stefano Campese, Ivano Lauriola, Alessandro Moschitti
An effective paradigm for building Automated Question Answering systems is the re-use of previously answered questions, e.g., for FAQs or forum applications. Given a database (DB) of question/answer (q/a) pairs, it is possible to answer a target question by scanning the DB for similar questions. In this paper, we scale this approach to open domain, making it competitive with other standard methods, e.g., unstructured document or graph based. For this purpose, we (i) build a large scale DB of 6.3M q/a pairs, using public questions, (ii) design a new system based on neural IR and a q/a pair reranker, and (iii) construct training and test data to perform comparative experiments with our models. We demonstrate that Transformer-based models using (q,a) pairs outperform models only based on question representation, for both neural search and reranking. Additionally, we show that our DB-based approach is competitive with Web-based methods, i.e., a QA system built on top the BING search engine, demonstrating the challenge of finding relevant information. Finally, we make our data and models available for future research.
Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Alberto Del Bimbo
Composed Image Retrieval (CIR) aims to retrieve a target image based on a
query composed of a reference image and a relative caption that describes the
difference between the two images. The high effort and cost required for
labeling datasets for CIR hamper the widespread usage of existing methods, as
they rely on supervised learning. In this work, we propose a new task,
Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled
training dataset. Our approach, named zero-Shot composEd imAge Retrieval with
textuaL invErsion (SEARLE), maps the visual features of the reference image
into a pseudo-word token in CLIP token embedding space and integrates it with
the relative caption. To support research on ZS-CIR, we introduce an
open-domain benchmarking dataset named Composed Image Retrieval on Common
Objects in context (CIRCO), which is the first dataset for CIR containing
multiple ground truths for each query. The experiments show that SEARLE
exhibits better performance than the baselines on the two main datasets for CIR
tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and
the model are publicly available at https://github.com/miccunifi/SEARLE.
Authors' comments: ICCV2023
Houxing Ren, Linjun Shou, Jian Pei, Ning Wu, Ming Gong, Daxin Jiang
Recent multilingual pre-trained models have shown better performance in
various multilingual tasks. However, these models perform poorly on
multilingual retrieval tasks due to lacking multilingual training data. In this
paper, we propose to mine and generate self-supervised training data based on a
large-scale unlabeled corpus. We carefully design a mining method which
combines the sparse and dense models to mine the relevance of unlabeled queries
and passages. And we introduce a query generator to generate more queries in
target languages for unlabeled passages. Through extensive experiments on Mr.
TYDI dataset and an industrial dataset from a commercial search engine, we
demonstrate that our method performs better than baselines based on various
pre-trained multilingual models. Our method even achieves on-par performance
with the supervised method on the latter dataset.
Authors' comments: EMNLP 2022 Findings
Daniel Nakhimovich, Yinglong Miao, Kostas E. Bekris
This work proposes a robot task planning framework for retrieving a target
object in a confined workspace among multiple stacked objects that obstruct the
target. The robot can use prehensile picking and in-workspace placing actions.
The method assumes access to 3D models for the visible objects in the scene.
The key contribution is in achieving desirable properties, i.e., to provide (a)
safety, by avoiding collisions with sensed obstacles, objects, and occluded
regions, and (b) resolution completeness (RC) - or probabilistic completeness
(PC) depending on implementation - which indicates a solution will be
eventually found (if it exists) as the resolution of algorithmic parameters
increases. A heuristic variant of the basic RC algorithm is also proposed to
solve the task more efficiently while retaining the desirable properties.
Simulation results compare using random picking and placing operations against
the basic RC algorithm that reasons about object dependency as well as its
heuristic variant. The success rate is higher for the RC approaches given the
same amount of time. The heuristic variant is able to solve the problem even
more efficiently than the basic approach. The integration of the RC algorithm
with perception, where an RGB-D sensor detects the objects as they are being
moved, enables real robot demonstrations of safely retrieving target objects
from a cluttered shelf.
Authors' comments: 7 pages, 4 figures, Accepted to IEEE International Conference on
Robotics and Automation (ICRA) 2023
WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, Jae-Pil Heo
Recently, video moment retrieval and highlight detection (MR/HD) are being
spotlighted as the demand for video understanding is drastically increased. The
key objective of MR/HD is to localize the moment and estimate clip-wise
accordance level, i.e., saliency score, to the given text query. Although the
recent transformer-based models brought some advances, we found that these
methods do not fully exploit the information of a given query. For example, the
relevance between text query and video contents is sometimes neglected when
predicting the moment and its saliency. To tackle this issue, we introduce
Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As
we observe the insignificant role of a given query in transformer
architectures, our encoding module starts with cross-attention layers to
explicitly inject the context of text query into video representation. Then, to
enhance the model's capability of exploiting the query information, we
manipulate the video-query pairs to produce irrelevant pairs. Such negative
(irrelevant) video-query pairs are trained to yield low saliency scores, which
in turn, encourages the model to estimate precise accordance between
query-video pairs. Lastly, we present an input-adaptive saliency predictor
which adaptively defines the criterion of saliency scores for the given
video-query pairs. Our extensive studies verify the importance of building the
query-dependent representation for MR/HD. Specifically, QD-DETR outperforms
state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets.
Codes are available at github.com/wjun0830/QD-DETR.
Authors' comments: Accepted to CVPR 2023. Code is available at
https://github.com/wjun0830/QD-DETR
Vaishali Pal, Carlos Lassance, Hervé Déjean, Stéphane Clinchant
Parameter-Efficient transfer learning with Adapters have been studied in
Natural Language Processing (NLP) as an alternative to full fine-tuning.
Adapters are memory-efficient and scale well with downstream tasks by training
small bottle-neck layers added between transformer layers while keeping the
large pretrained language model (PLMs) frozen. In spite of showing promising
results in NLP, these methods are under-explored in Information Retrieval.
While previous studies have only experimented with dense retriever or in a
cross lingual retrieval scenario, in this paper we aim to complete the picture
on the use of adapters in IR. First, we study adapters for SPLADE, a sparse
retriever, for which adapters not only retain the efficiency and effectiveness
otherwise achieved by finetuning, but are memory-efficient and orders of
magnitude lighter to train. We observe that Adapters-SPLADE not only optimizes
just 2\% of training parameters, but outperforms fully fine-tuned counterpart
and existing parameter-efficient dense IR models on IR benchmark datasets.
Secondly, we address domain adaptation of neural retrieval thanks to adapters
on cross-domain BEIR datasets and TripClick. Finally, we also consider
knowledge sharing between rerankers and first stage rankers. Overall, our study
complete the examination of adapters for neural IR
Authors' comments: accepted at ECIR'23
Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou et al.
The task of repository-level code completion is to continue writing the
unfinished code based on a broader context of the repository. While for
automated code completion tools, it is difficult to utilize the useful
information scattered in different files. We propose RepoCoder, a simple,
generic, and effective framework to address the challenge. It streamlines the
repository-level code completion process by incorporating a similarity-based
retriever and a pre-trained code language model in an iterative
retrieval-generation pipeline. RepoCoder makes effective utilization of
repository-level information for code completion and has the ability to
generate code at various levels of granularity. Moreover, we propose a new
benchmark RepoEval, which consists of the latest and high-quality real-world
repositories covering line, API invocation, and function body completion
scenarios. Experimental results indicate that RepoCoder significantly improves
the In-File completion baseline by over 10% in all settings and consistently
outperforms the vanilla retrieval-augmented code completion approach.
Furthermore, we validate the effectiveness of RepoCoder through comprehensive
analysis, providing valuable insights for future research. Our source code and
benchmark are publicly available:
https://github.com/microsoft/CodeT/tree/main/RepoCoder
Authors' comments: accepted by EMNLP 2023 main conference
Gustavo Penha, Enrico Palumbo, Maryam Aziz, Alice Wang, Hugues Bouchard
An important goal of online platforms is to enable content discovery, i.e.
allow users to find a catalog entity they were not familiar with. A
pre-requisite to discover an entity, e.g. a book, with a search engine is that
the entity is retrievable, i.e. there are queries for which the system will
surface such entity in the top results. However, machine-learned search engines
have a high retrievability bias, where the majority of the queries return the
same entities. This happens partly due to the predominance of narrow intent
queries, where users create queries using the title of an already known entity,
e.g. in book search 'harry potter'. The amount of broad queries where users
want to discover new entities, e.g. in music search 'chill lyrical electronica
with an atmospheric feeling to it', and have a higher tolerance to what they
might find, is small in comparison. We focus here on two factors that have a
negative impact on the retrievability of the entities (I) the training data
used for dense retrieval models and (II) the distribution of narrow and broad
intent queries issued in the system. We propose CtrlQGen, a method that
generates queries for a chosen underlying intent-narrow or broad. We can use
CtrlQGen to improve factor (I) by generating training data for dense retrieval
models comprised of diverse synthetic queries. CtrlQGen can also be used to
deal with factor (II) by suggesting queries with broader intents to users. Our
results on datasets from the domains of music, podcasts, and books reveal that
we can significantly decrease the retrievability bias of a dense retrieval
model when using CtrlQGen. First, by using the generated queries as training
data for dense models we make 9% of the entities retrievable (go from zero to
non-zero retrievability). Second, by suggesting broader queries to users, we
can make 12% of the entities retrievable in the best case.
Authors' comments: Accepted for publication in the International World Wide Web
Conference 2023
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun
This paper proposes a novel diffusion-based model, CompoDiff, for solving
zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper
also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8
million reference images, conditions, and corresponding target image triplets
to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the
previous CIR approaches, such as poor generalizability due to the small dataset
scale and the limited types of conditions. CompoDiff not only achieves a new
state-of-the-art on four ZS-CIR benchmarks, including FashionIQ, CIRR, CIRCO,
and GeneCIS, but also enables a more versatile and controllable CIR by
accepting various conditions, such as negative text, and image mask conditions.
CompoDiff also shows the controllability of the condition strength between text
and image queries and the trade-off between inference speed and performance,
which are unavailable with existing CIR methods. The code and dataset are
available at https://github.com/navervision/CompoDiff
Authors' comments: TMLR camera-ready; First two authors contributed equally; TMLR Expert
Certification; 30 pages, 5.9MB
Guoliang Wang, Yanlei Shang, Yong Chen
A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts. Most existing methods mainly focus on coarse-grained correspondences based on co-occurrences of semantic objects, while failing to distinguish the fine-grained local correspondences. In this paper, we propose a novel Scene Graph based Fusion Network (dubbed SGFN), which enhances the images'/texts' features through intra- and cross-modal fusion for image-text retrieval. To be specific, we design an intra-modal hierarchical attention fusion to incorporate semantic contexts, such as objects, attributes, and relationships, into images'/texts' feature vectors via scene graphs, and a cross-modal attention fusion to combine the contextual semantics and local fusion via contextual vectors. Extensive experiments on public datasets Flickr30K and MSCOCO show that our SGFN performs better than quite a few SOTA image-text retrieval methods.
Li Yi
Generating lyrics and poems is one of the essential downstream tasks in the Natural Language Processing (NLP) field. Current methods have performed well in some lyrics generation scenarios but need further improvements in tasks requiring fine control. We propose a novel method for generating ancient Chinese lyrics (Song Ci), a type of ancient lyrics that involves precise control of song structure. The system is equipped with a phrase retriever and a phrase connector. Based on an input prompt, the phrase retriever picks phrases from a database to construct a phrase pool. The phrase connector then selects a series of phrases from the phrase pool that minimizes a multi-term loss function that considers rhyme, song structure, and fluency. Experimental results show that our method can generate high-quality ancient Chinese lyrics while performing well on topic and song structure control. We also expect our approach to be generalized to other lyrics-generating tasks.
Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Chengwei Qin, Bosheng Ding, Xiaobao Guo et al.
As Large Language Models (LLMs) become popular, there emerged an important trend of using multimodality to augment the LLMs' generation ability, which enables LLMs to better interact with the world. However, there lacks a unified perception of at which stage and how to incorporate different modalities. In this survey, we review methods that assist and augment generative models by retrieving multimodal knowledge, whose formats range from images, codes, tables, graphs, to audio. Such methods offer a promising solution to important concerns such as factuality, reasoning, interpretability, and robustness. By providing an in-depth review, this survey is expected to provide scholars with a deeper understanding of the methods' applications and encourage them to adapt existing techniques to the fast-growing field of LLMs.
Liang Yan, Shengzhong Zhang, Bisheng Li, Min Zhou, Zengfeng Huang
Extremely skewed label distributions are common in real-world node classification tasks. If not dealt with appropriately, it significantly hurts the performance of GNNs in minority classes. Due to its practical importance, there have been a series of recent research devoted to this challenge. Existing over-sampling techniques smooth the label distribution by generating ``fake'' minority nodes and synthesizing their features and local topology, which largely ignore the rich information of unlabeled nodes on graphs. In this paper, we propose UNREAL, an iterative over-sampling method. The first key difference is that we only add unlabeled nodes instead of synthetic nodes, which eliminates the challenge of feature and neighborhood generation. To select which unlabeled nodes to add, we propose geometric ranking to rank unlabeled nodes. Geometric ranking exploits unsupervised learning in the node embedding space to effectively calibrates pseudo-label assignment. Finally, we identify the issue of geometric imbalance in the embedding space and provide a simple metric to filter out geometrically imbalanced nodes. Extensive experiments on real-world benchmark datasets are conducted, and the empirical results show that our method significantly outperforms current state-of-the-art methods consistent on different datasets with different imbalance ratios.
SeungHeon Doh, Minz Won, Keunwoo Choi, Juhan Nam
We introduce a framework that recommends music based on the emotions of
speech. In content creation and daily life, speech contains information about
human emotions, which can be enhanced by music. Our framework focuses on a
cross-domain retrieval system to bridge the gap between speech and music via
emotion labels. We explore different speech representations and report their
impact on different speech types, including acting voice and wake-up words. We
also propose an emotion similarity regularization term in cross-domain
retrieval tasks. By incorporating the regularization term into training,
similar speech-and-music pairs in the emotion space are closer in the joint
embedding space. Our comprehensive experimental results show that the proposed
model is effective in textless speech-to-music retrieval.
Authors' comments: To Appear IEEE ICASSP 2023
Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Xiangyang Ji, Chang Liu, Li Yuan, Jie Chen
Existing text-video retrieval solutions are, in essence, discriminant models
focused on maximizing the conditional likelihood, i.e., p(candidates|query).
While straightforward, this de facto paradigm overlooks the underlying data
distribution p(query), which makes it challenging to identify
out-of-distribution data. To address this limitation, we creatively tackle this
task from a generative viewpoint and model the correlation between the text and
the video as their joint probability p(candidates,query). This is accomplished
through a diffusion-based text-video retrieval framework (DiffusionRet), which
models the retrieval task as a process of gradually generating joint
distribution from noise. During training, DiffusionRet is optimized from both
the generation and discrimination perspectives, with the generator being
optimized by generation loss and the feature extractor trained with contrastive
loss. In this way, DiffusionRet cleverly leverages the strengths of both
generative and discriminative methods. Extensive experiments on five commonly
used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD,
ActivityNet Captions, and DiDeMo, with superior performances, justify the
efficacy of our method. More encouragingly, without any modification,
DiffusionRet even performs well in out-domain retrieval settings. We believe
this work brings fundamental insights into the related fields. Code is
available at https://github.com/jpthu17/DiffusionRet.
Authors' comments: Accepted by ICCV 2023
Valerio La Gatta, Chiyu Wei, Luca Luceri, Francesco Pierri, Emilio Ferrara
Nowadays, false and unverified information on social media sway individuals'
perceptions during major geo-political events and threaten the quality of the
whole digital information ecosystem. Since the Russian invasion of Ukraine,
several fact-checking organizations have been actively involved in verifying
stories related to the conflict that circulated online. In this paper, we
leverage a public repository of fact-checked claims to build a methodological
framework for automatically identifying false and unsubstantiated claims
spreading on Twitter in February 2022. Our framework consists of two sequential
models: First, the claim detection model identifies whether tweets incorporate
a (false) claim among those considered in our collection. Then, the claim
retrieval model matches the tweets with fact-checked information by ranking
verified claims according to their relevance with the input tweet. Both models
are based on pre-trained language models and fine-tuned to perform a text
classification task and an information retrieval task, respectively. In
particular, to validate the effectiveness of our methodology, we consider 83
verified false claims that spread on Twitter during the first week of the
invasion, and manually annotate 5,872 tweets according to the claim(s) they
report. Our experiments show that our proposed methodology outperforms standard
baselines for both claim detection and claim retrieval. Overall, our results
highlight how social media providers could effectively leverage semi-automated
approaches to identify, track, and eventually moderate false information that
spreads on their platforms.
Authors' comments: 7 pages, 2 figures, WWW23 Companion Proceedings
Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski
The task of Composed Image Retrieval (CoIR) involves queries that combine
image and text modalities, allowing users to express their intent more
effectively. However, current CoIR datasets are orders of magnitude smaller
compared to other vision and language (V&L) datasets. Additionally, some of
these datasets have noticeable issues, such as queries containing redundant
modalities. To address these shortcomings, we introduce the Large Scale
Composed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten times
larger than existing ones. Pre-training on our LaSCo, shows a noteworthy
improvement in performance, even in zero-shot. Furthermore, we propose a new
approach for analyzing CoIR datasets and methods, which detects modality
redundancy or necessity, in queries. We also introduce a new CoIR baseline, the
Cross-Attention driven Shift Encoder (CASE). This baseline allows for early
fusion of modalities using a cross-attention module and employs an additional
auxiliary task during training. Our experiments demonstrate that this new
baseline outperforms the current state-of-the-art methods on established
benchmarks like FashionIQ and CIRR.
Authors' comments: Camera Ready version for AAAI 2024
Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei et al.
Large Language Models (LLMs) are popular for their impressive abilities, but
the need for model-specific fine-tuning or task-specific prompt engineering can
hinder their generalization. We propose UPRISE (Universal Prompt Retrieval for
Improving zero-Shot Evaluation), which tunes a lightweight and versatile
retriever that automatically retrieves prompts for a given zero-shot task
input. Specifically, we demonstrate universality in a cross-task and
cross-model scenario: the retriever is tuned on a diverse set of tasks, but
tested on unseen task types; we use a small frozen LLM, GPT-Neo-2.7B, for
tuning the retriever, but test the retriever on different LLMs of much larger
scales, such as BLOOM-7.1B, OPT-66B and GPT3-175B. Additionally, we show that
UPRISE mitigates the hallucination problem in our experiments with ChatGPT,
suggesting its potential to improve even the strongest LLMs. Our model and code
are available at https://github.com/microsoft/LMOps.
Authors' comments: EMNLP 2023 Main Conference
Won Jo, Geuntaek Lim, Gwangjin Lee, Hyunwoo Kim, Byungsoo Ko, Yukyung Choi
In content-based video retrieval (CBVR), dealing with large-scale
collections, efficiency is as important as accuracy; thus, several video-level
feature-based studies have actively been conducted. Nevertheless, owing to the
severe difficulty of embedding a lengthy and untrimmed video into a single
feature, these studies have been insufficient for accurate retrieval compared
to frame-level feature-based studies. In this paper, we show that appropriate
suppression of irrelevant frames can provide insight into the current obstacles
of the video-level approaches. Furthermore, we propose a Video-to-Video
Suppression network (VVS) as a solution. VVS is an end-to-end framework that
consists of an easy distractor elimination stage to identify which frames to
remove and a suppression weight generation stage to determine the extent to
suppress the remaining frames. This structure is intended to effectively
describe an untrimmed video with varying content and meaningless information.
Its efficacy is proved via extensive experiments, and we show that our approach
is not only state-of-the-art in video-level approaches but also has a fast
inference time despite possessing retrieval capabilities close to those of
frame-level approaches. Code is available at https://github.com/sejong-rcv/VVS
Authors' comments: AAAI-24
Min Cao, Yang Bai, Jingyao Wang, Ziqiang Cao, Liqiang Nie, Min Zhang
Under the flourishing development in performance, current image-text
retrieval methods suffer from $N$-related time complexity, which hinders their
application in practice. Targeting at efficiency improvement, this paper
presents a simple and effective keyword-guided pre-screening framework for the
image-text retrieval. Specifically, we convert the image and text data into the
keywords and perform the keyword matching across modalities to exclude a large
number of irrelevant gallery samples prior to the retrieval network. For the
keyword prediction, we transfer it into a multi-label classification problem
and propose a multi-task learning scheme by appending the multi-label
classifiers to the image-text retrieval network to achieve a lightweight and
high-performance keyword prediction. For the keyword matching, we introduce the
inverted index in the search engine and create a win-win situation on both time
and space complexities for the pre-screening. Extensive experiments on two
widely-used datasets, i.e., Flickr30K and MS-COCO, verify the effectiveness of
the proposed framework. The proposed framework equipped with only two embedding
layers achieves $O(1)$ querying time complexity, while improving the retrieval
efficiency and keeping its performance, when applied prior to the common
image-text retrieval methods. Our code will be released.
Authors' comments: 11 pages, 7 figures, 6 tables