Steven Ndung'u, Trienko Grobler, Stefan J. Wijnholds, Dimka Karastoyanova, George Azzopardi
The shear number of sources that will be detected by next-generation radio
surveys will be astronomical, which will result in serendipitous discoveries.
Data-dependent deep hashing algorithms have been shown to be efficient at image
retrieval tasks in the fields of computer vision and multimedia. However, there
are limited applications of these methodologies in the field of astronomy. In
this work, we utilize deep hashing to rapidly search for similar images in a
large database. The experiment uses a balanced dataset of 2708 samples
consisting of four classes: Compact, FRI, FRII, and Bent. The performance of
the method was evaluated using the mean average precision (mAP) metric where a
precision of 88.5\% was achieved. The experimental results demonstrate the
capability to search and retrieve similar radio images efficiently and at
scale. The retrieval is based on the Hamming distance between the binary hash
of the query image and those of the reference images in the database.
Authors' comments: 4 pages, 4 figures
Dezhao Luo, Jiabo Huang, Shaogang Gong, Hailin Jin, Yang Liu
Accurate video moment retrieval (VMR) requires universal visual-textual
correlations that can handle unknown vocabulary and unseen scenes. However, the
learned correlations are likely either biased when derived from a limited
amount of moment-text data which is hard to scale up because of the prohibitive
annotation cost (fully-supervised), or unreliable when only the video-text
pairwise relationships are available without fine-grained temporal annotations
(weakly-supervised). Recently, the vision-language models (VLM) demonstrate a
new transfer learning paradigm to benefit different vision tasks through the
universal visual-textual correlations derived from large-scale vision-language
pairwise web data, which has also shown benefits to VMR by fine-tuning in the
target domains. In this work, we propose a zero-shot method for adapting
generalisable visual-textual priors from arbitrary VLM to facilitate
moment-text alignment, without the need for accessing the VMR data. To this
end, we devise a conditional feature refinement module to generate
boundary-aware visual features conditioned on text queries to enable better
moment boundary understanding. Additionally, we design a bottom-up proposal
generation strategy that mitigates the impact of domain discrepancies and
breaks down complex-query retrieval tasks into individual action retrievals,
thereby maximizing the benefits of VLM. Extensive experiments conducted on
three VMR benchmark datasets demonstrate the notable performance advantages of
our zero-shot algorithm, especially in the novel-word and novel-location
out-of-distribution setups.
Authors' comments: Accepted by WACV 2024
Hui Wang, Shiwan Zhao, Xiguang Zheng, Yong Qin
Automatic Mean Opinion Score (MOS) prediction is crucial to evaluate the
perceptual quality of the synthetic speech. While recent approaches using
pre-trained self-supervised learning (SSL) models have shown promising results,
they only partly address the data scarcity issue for the feature extractor.
This leaves the data scarcity issue for the decoder unresolved and leading to
suboptimal performance. To address this challenge, we propose a
retrieval-augmented MOS prediction method, dubbed {\bf RAMP}, to enhance the
decoder's ability against the data scarcity issue. A fusing network is also
proposed to dynamically adjust the retrieval scope for each instance and the
fusion weights based on the predictive confidence. Experimental results show
that our proposed method outperforms the existing methods in multiple
scenarios.
Authors' comments: Accepted by Interspeech 2023, oral
Jiangui Chen, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, Yixing Fan, Xueqi Cheng
Generative retrieval (GR) directly predicts the identifiers of relevant
documents (i.e., docids) based on a parametric model. It has achieved solid
performance on many ad-hoc retrieval tasks. So far, these tasks have assumed a
static document collection. In many practical scenarios, however, document
collections are dynamic, where new documents are continuously added to the
corpus. The ability to incrementally index new documents while preserving the
ability to answer queries with both previously and newly indexed relevant
documents is vital to applying GR models. In this paper, we address this
practical continual learning problem for GR. We put forward a novel
Continual-LEarner for generatiVE Retrieval (CLEVER) model and make two major
contributions to continual learning for GR: (i) To encode new documents into
docids with low computational cost, we present Incremental Product
Quantization, which updates a partial quantization codebook according to two
adaptive thresholds; and (ii) To memorize new documents for querying without
forgetting previous knowledge, we propose a memory-augmented learning
mechanism, to form meaningful connections between old and new documents.
Empirical results demonstrate the effectiveness and efficiency of the proposed
model.
Authors' comments: Accepted by CIKM 2023
Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol
Composed Image Retrieval (CoIR) has recently gained popularity as a task that
considers both text and image queries together, to search for relevant images
in a database. Most CoIR approaches require manually annotated datasets,
comprising image-text-image triplets, where the text describes a modification
from the query image to the target image. However, manual curation of CoIR
triplets is expensive and prevents scalability. In this work, we instead
propose a scalable automatic dataset creation methodology that generates
triplets given video-caption pairs, while also expanding the scope of the task
to include composed video retrieval (CoVR). To this end, we mine paired videos
with a similar caption from a large database, and leverage a large language
model to generate the corresponding modification text. Applying this
methodology to the extensive WebVid2M collection, we automatically construct
our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we
introduce a new benchmark for CoVR with a manually annotated evaluation set,
along with baseline results. We further validate that our methodology is
equally applicable to image-caption pairs, by generating 3.3 million CoIR
training triplets using the Conceptual Captions dataset. Our model builds on
BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and
incorporates an additional caption retrieval loss to exploit extra supervision
beyond the triplet. We provide extensive ablations to analyze the design
choices on our new CoVR benchmark. Our experiments also demonstrate that
training a CoVR model on our datasets effectively transfers to CoIR, leading to
improved state-of-the-art performance in the zero-shot setup on the CIRR,
FashionIQ, and CIRCO benchmarks. Our code, datasets, and models are publicly
available at https://imagine.enpc.fr/ ventural/covr.
Authors' comments: Appears in TPAMI 2024 (DOI: 10.1109/TPAMI.2024.3463799). Journal
extension of the AAAI 2024 conference paper arXiv:2308.14746v3. Project page:
https://imagine.enpc.fr/~ventural/covr/
Jian Zhu, Wen Cheng, Yu Cui, Chang Tang, Yuyang Dai, Yong Li, Lingfang Zeng
Hash representation learning of multi-view heterogeneous data is the key to
improving the accuracy of multimedia retrieval. However, existing methods
utilize local similarity and fall short of deeply fusing the multi-view
features, resulting in poor retrieval accuracy. Current methods only use local
similarity to train their model. These methods ignore global similarity.
Furthermore, most recent works fuse the multi-view features via a weighted sum
or concatenation. We contend that these fusion methods are insufficient for
capturing the interaction between various views. We present a novel Central
Similarity Multi-View Hashing (CSMVH) method to address the mentioned problems.
Central similarity learning is used for solving the local similarity problem,
which can utilize the global similarity between the hash center and samples. We
present copious empirical data demonstrating the superiority of gate-based
fusion over conventional approaches. On the MS COCO and NUS-WIDE, the proposed
CSMVH performs better than the state-of-the-art methods by a large margin (up
to 11.41% mean Average Precision (mAP) improvement).
Authors' comments: accepted by the Asia Pacific Web (APWeb) and Web-Age Information
Management (WAIM) Joint International Conference on Web and Big Data
(APWeb-WAIM2023)
Jia Li, Yongmin Li, Ge Li, Xing Hu, Xin Xia, Zhi Jin
Existing studies show that code summaries help developers understand and
maintain source code. Unfortunately, these summaries are often missing or
outdated in software projects. Code summarization aims to generate natural
language descriptions automatically for source code. Code summaries are highly
structured and have repetitive patterns. Besides the patternized words, a code
summary also contains important keywords, which are the key to reflecting the
functionality of the code. However, the state-of-the-art approaches perform
poorly on predicting the keywords, which leads to the generated summaries
suffering a loss in informativeness. To alleviate this problem, this paper
proposes a novel retrieve-and-edit approach named EditSum for code
summarization. Specifically, EditSum first retrieves a similar code snippet
from a pre-defined corpus and treats its summary as a prototype summary to
learn the pattern. Then, EditSum edits the prototype automatically to combine
the pattern in the prototype with the semantic information of input code. Our
motivation is that the retrieved prototype provides a good start-point for
post-generation because the summaries of similar code snippets often have the
same pattern. The post-editing process further reuses the patternized words in
the prototype and generates keywords based on the semantic information of input
code. We conduct experiments on a large-scale Java corpus and experimental
results demonstrate that EditSum outperforms the state-of-the-art approaches by
a substantial margin. The human evaluation also proves the summaries generated
by EditSum are more informative and useful. We also verify that EditSum
performs well on predicting the patternized words and keywords.
Authors' comments: Accepted by the 36th IEEE/ACM International Conference on Automated
Software Engineering (ASE 2021)
Hongsong Wang, Yuqi Zhang
Patent retrieval has been attracting tremendous interest from researchers in intellectual property and information retrieval communities in the past decades. However, most existing approaches rely on textual and metadata information of the patent, and content-based image-based patent retrieval is rarely investigated. Based on traits of patent drawing images, we present a simple and lightweight model for this task. Without bells and whistles, this approach significantly outperforms other counterparts on a large-scale benchmark and noticeably improves the state-of-the-art by 33.5% with the mean average precision (mAP) score. Further experiments reveal that this model can be elaborately scaled up to achieve a surprisingly high mAP of 93.5%. Our method ranks first in the ECCV 2022 Patent Diagram Image Retrieval Challenge.
Yuan Yuan, Yang Zhan, Zhitong Xiong
Vision-and-language pre-training (VLP) models have experienced a surge in popularity recently. By fine-tuning them on specific datasets, significant performance improvements have been observed in various tasks. However, full fine-tuning of VLP models not only consumes a significant amount of computational resources but also has a significant environmental impact. Moreover, as remote sensing (RS) data is constantly being updated, full fine-tuning may not be practical for real-world applications. To address this issue, in this work, we investigate the parameter-efficient transfer learning (PETL) method to effectively and efficiently transfer visual-language knowledge from the natural domain to the RS domain on the image-text retrieval task. To this end, we make the following contributions. 1) We construct a novel and sophisticated PETL framework for the RS image-text retrieval (RSITR) task, which includes the pretrained CLIP model, a multimodal remote sensing adapter, and a hybrid multi-modal contrastive (HMMC) learning objective; 2) To deal with the problem of high intra-modal similarity in RS data, we design a simple yet effective HMMC loss; 3) We provide comprehensive empirical studies for PETL-based RS image-text retrieval. Our results demonstrate that the proposed method is promising and of great potential for practical applications. 4) We benchmark extensive state-of-the-art PETL methods on the RSITR task. Our proposed model only contains 0.16M training parameters, which can achieve a parameter reduction of 98.9% compared to full fine-tuning, resulting in substantial savings in training costs. Our retrieval performance exceeds traditional methods by 7-13% and achieves comparable or better performance than full fine-tuning. This work can provide new ideas and useful insights for RS vision-language tasks.
Hassan ZivariFard, Remi A. Chou
Consider Private Information Retrieval (PIR), where a client wants to retrieve one file out of $K$ files that are replicated in $N$ different servers and the client selection must remain private when up to $T$ servers may collude. Additionally, suppose that the client has noisy side information about each of the $K$ files, and the side information about a specific file is obtained by passing this file through one of $D$ possible discrete memoryless test channels, where $D\le K$. While the statistics of the test channels are known by the client and by all the servers, the specific mapping $\boldsymbol{\calM}$ between the files and the test channels is unknown to the servers. We study this problem under two different privacy metrics. Under the first privacy metric, the client wants to preserve the privacy of its desired file selection and the mapping $\boldsymbol{\calM}$. Under the second privacy metric, the client wants to preserve the privacy of its desired file and the mapping $\boldsymbol{\calM}$ but is willing to reveal the index of the test channel that is associated to its desired file. For both of these two privacy metrics, we derive the optimal normalized download cost. Our problem setup generalizes PIR with colluding servers, PIR with private noiseless side information, and PIR with private side information under storage constraints.
Alptug Aytekin, Mohamed Nomeir, Sajani Vithana, Sennur Ulukus
We consider both the classical and quantum variations of $X$-secure, $E$-eavesdropped and $T$-colluding symmetric private information retrieval (SPIR). This is the first work to study SPIR with $X$-security in classical or quantum variations. We first develop a scheme for classical $X$-secure, $E$-eavesdropped and $T$-colluding SPIR (XSETSPIR) based on a modified version of cross subspace alignment (CSA), which achieves a rate of $R= 1 - \frac{X+\max(T,E)}{N}$. The modified scheme achieves the same rate as the scheme used for $X$-secure PIR with the extra benefit of symmetric privacy. Next, we extend this scheme to its quantum counterpart based on the $N$-sum box abstraction. This is the first work to consider the presence of eavesdroppers in quantum private information retrieval (QPIR). In the quantum variation, the eavesdroppers have better access to information over the quantum channel compared to the classical channel due to the over-the-air decodability. To that end, we develop another scheme specialized to combat eavesdroppers over quantum channels. The scheme proposed for $X$-secure, $E$-eavesdropped and $T$-colluding quantum SPIR (XSETQSPIR) in this work maintains the super-dense coding gain from the shared entanglement between the databases, i.e., achieves a rate of $R_Q = \min\left\{ 1, 2\left(1-\frac{X+\max(T,E)}{N}\right)\right\}$.
Anwesan Pal, Sahil Wadhwa, Ayush Jaiswal, Xu Zhang, Yue Wu, Rakesh Chada, Pradeep Natarajan, Henrik I. Christensen
Multi-turn textual feedback-based fashion image retrieval focuses on a
real-world setting, where users can iteratively provide information to refine
retrieval results until they find an item that fits all their requirements. In
this work, we present a novel memory-based method, called FashionNTM, for such
a multi-turn system. Our framework incorporates a new Cascaded Memory Neural
Turing Machine (CM-NTM) approach for implicit state management, thereby
learning to integrate information across all past turns to retrieve new images,
for a given turn. Unlike vanilla Neural Turing Machine (NTM), our CM-NTM
operates on multiple inputs, which interact with their respective memories via
individual read and write heads, to learn complex relationships. Extensive
evaluation results show that our proposed method outperforms the previous
state-of-the-art algorithm by 50.5%, on Multi-turn FashionIQ -- the only
existing multi-turn fashion dataset currently, in addition to having a relative
improvement of 12.6% on Multi-turn Shoes -- an extension of the single-turn
Shoes dataset that we created in this work. Further analysis of the model in a
real-world interactive setting demonstrates two important capabilities of our
model -- memory retention across turns, and agnosticity to turn order for
non-contradictory feedback. Finally, user study results show that images
retrieved by FashionNTM were favored by 83.1% over other multi-turn models.
Project page: https://sites.google.com/eng.ucsd.edu/fashionntm
Authors' comments: Paper accepted at ICCV-2023
Kaiqu Liang, Samuel Albanie
To date, the majority of video retrieval systems have been optimized for a
"single-shot" scenario in which the user submits a query in isolation, ignoring
previous interactions with the system. Recently, there has been renewed
interest in interactive systems to enhance retrieval, but existing approaches
are complex and deliver limited gains in performance. In this work, we revisit
this topic and propose several simple yet effective baselines for interactive
video retrieval via question-answering. We employ a VideoQA model to simulate
user interactions and show that this enables the productive study of the
interactive retrieval task without access to ground truth dialogue data.
Experiments on MSR-VTT, MSVD, and AVSD show that our framework using
question-based interaction significantly improves the performance of text-based
video retrieval systems.
Authors' comments: ICCV 2023, project page:
https://github.com/kevinliang888/IVR-QA-baselines
Kaihang Pan, Juncheng Li, Hongye Song, Hao Fei, Wei Ji, Shuo Zhang, Jun Lin, Xiaozhong Liu et al.
Recent studies have shown that dense retrieval models, lacking dedicated training data, struggle to perform well across diverse retrieval tasks, as different retrieval tasks often entail distinct search intents. To address this challenge, in this work we introduce ControlRetriever, a generic and efficient approach with a parameter isolated architecture, capable of controlling dense retrieval models to directly perform varied retrieval tasks, harnessing the power of instructions that explicitly describe retrieval intents in natural language. Leveraging the foundation of ControlNet, which has proven powerful in text-to-image generation, ControlRetriever imbues different retrieval models with the new capacity of controllable retrieval, all while being guided by task-specific instructions. Furthermore, we propose a novel LLM guided Instruction Synthesizing and Iterative Training strategy, which iteratively tunes ControlRetriever based on extensive automatically-generated retrieval data with diverse instructions by capitalizing the advancement of large language models. Extensive experiments show that in the BEIR benchmark, with only natural language descriptions of specific retrieval intent for each task, ControlRetriever, as a unified multi-task retrieval system without task-specific tuning, significantly outperforms baseline methods designed with task-specific retrievers and also achieves state-of-the-art zero-shot performance.
Ziyang Yuan, Haoxing Yang, Ningyi Leng, Hongxia Wang
Fourier phase retrieval(PR) is a severely ill-posed inverse problem that arises in various applications. To guarantee a unique solution and relieve the dependence on the initialization, background information can be exploited as a structural priors. However, the requirement for the background information may be challenging when moving to the high-resolution imaging. At the same time, the previously proposed projected gradient descent(PGD) method also demands much background information. In this paper, we present an improved theoretical result about the demand for the background information, along with two Douglas Rachford(DR) based methods. Analytically, we demonstrate that the background required to ensure a unique solution can be decreased by nearly $1/2$ for the 2-D signals compared to the 1-D signals. By generalizing the results into $d$-dimension, we show that the length of the background information more than $(2^{\frac{d+1}{d}}-1)$ folds of the signal is sufficient to ensure the uniqueness. At the same time, we also analyze the stability and robustness of the model when measurements and background information are corrupted by the noise. Furthermore, two methods called Background Douglas-Rachford (BDR) and Convex Background Douglas-Rachford (CBDR) are proposed. BDR which is a kind of non-convex method is proven to have the local R-linear convergence rate under mild assumptions. Instead, CBDR method uses the techniques of convexification and can be proven to own a global convergence guarantee as long as the background information is sufficient. To support this, a new property called F-RIP is established. We test the performance of the proposed methods through simulations as well as real experimental measurements, and demonstrate that they achieve a higher recovery rate with less background information compared to the PGD method.
Aishwarya Venkataramanan, Martin Laviale, Cédric Pradalier
Most of the research in content-based image retrieval (CBIR) focus on
developing robust feature representations that can effectively retrieve
instances from a database of images that are visually similar to a query.
However, the retrieved images sometimes contain results that are not
semantically related to the query. To address this, we propose a method for
CBIR that captures both visual and semantic similarity using a visual
hierarchy. The hierarchy is constructed by merging classes with overlapping
features in the latent space of a deep neural network trained for
classification, assuming that overlapping classes share high visual and
semantic similarities. Finally, the constructed hierarchy is integrated into
the distance calculation metric for similarity search. Experiments on standard
datasets: CUB-200-2011 and CIFAR100, and a real-life use case using diatom
microscopy images show that our method achieves superior performance compared
to the existing methods on image retrieval.
Authors' comments: Accepted in ICVS 2023
Siqi Song, Qi Lv, Lei Geng, Ziqiang Cao, Guohong Fu
Chinese Spelling Check (CSC) refers to the detection and correction of spelling errors in Chinese texts. In practical application scenarios, it is important to make CSC models have the ability to correct errors across different domains. In this paper, we propose a retrieval-augmented spelling check framework called RSpell, which searches corresponding domain terms and incorporates them into CSC models. Specifically, we employ pinyin fuzzy matching to search for terms, which are combined with the input and fed into the CSC model. Then, we introduce an adaptive process control mechanism to dynamically adjust the impact of external knowledge on the model. Additionally, we develop an iterative strategy for the RSpell framework to enhance reasoning capabilities. We conducted experiments on CSC datasets in three domains: law, medicine, and official document writing. The results demonstrate that RSpell achieves state-of-the-art performance in both zero-shot and fine-tuning scenarios, demonstrating the effectiveness of the retrieval-augmented CSC framework. Our code is available at https://github.com/47777777/Rspell.
Chaorui Deng, Qi Chen, Pengda Qin, Da Chen, Qi Wu
In text-video retrieval, recent works have benefited from the powerful
learning capabilities of pre-trained text-image foundation models (e.g., CLIP)
by adapting them to the video domain. A critical problem for them is how to
effectively capture the rich semantics inside the video using the image encoder
of CLIP. To tackle this, state-of-the-art methods adopt complex cross-modal
modeling techniques to fuse the text information into video frame
representations, which, however, incurs severe efficiency issues in large-scale
retrieval systems as the video representations must be recomputed online for
every text query. In this paper, we discard this problematic cross-modal fusion
process and aim to learn semantically-enhanced representations purely from the
video, so that the video representations can be computed offline and reused for
different texts. Concretely, we first introduce a spatial-temporal "Prompt
Cube" into the CLIP image encoder and iteratively switch it within the encoder
layers to efficiently incorporate the global video semantics into frame
representations. We then propose to apply an auxiliary video captioning
objective to train the frame representations, which facilitates the learning of
detailed video semantics by providing fine-grained guidance in the semantic
space. With a naive temporal fusion strategy (i.e., mean-pooling) on the
enhanced frame representations, we obtain state-of-the-art performances on
three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
Authors' comments: to be appeared in ICCV2023
Junyang Chen, Hanjiang Lai
Text-guided image retrieval is to incorporate conditional text to better capture users' intent. Traditionally, the existing methods focus on minimizing the embedding distances between the source inputs and the targeted image, using the provided triplets $\langle$source image, source text, target image$\rangle$. However, such triplet optimization may limit the learned retrieval model to capture more detailed ranking information, e.g., the triplets are one-to-one correspondences and they fail to account for many-to-many correspondences arising from semantic diversity in feedback languages and images. To capture more ranking information, we propose a novel ranking-aware uncertainty approach to model many-to-many correspondences by only using the provided triplets. We introduce uncertainty learning to learn the stochastic ranking list of features. Specifically, our approach mainly comprises three components: (1) In-sample uncertainty, which aims to capture semantic diversity using a Gaussian distribution derived from both combined and target features; (2) Cross-sample uncertainty, which further mines the ranking information from other samples' distributions; and (3) Distribution regularization, which aligns the distributional representations of source inputs and targeted image. Compared to the existing state-of-the-art methods, our proposed method achieves significant results on two public datasets for composed image retrieval.
Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zhicheng Dou et al.
As a primary means of information acquisition, information retrieval (IR)
systems, such as search engines, have integrated themselves into our daily
lives. These systems also serve as components of dialogue, question-answering,
and recommender systems. The trajectory of IR has evolved dynamically from its
origins in term-based methods to its integration with advanced neural models.
While the neural models excel at capturing complex contextual signals and
semantic nuances, thereby reshaping the IR landscape, they still face
challenges such as data scarcity, interpretability, and the generation of
contextually plausible yet potentially inaccurate responses. This evolution
requires a combination of both traditional methods (such as term-based sparse
retrieval methods with rapid response) and modern neural architectures (such as
language models with powerful language understanding capacity). Meanwhile, the
emergence of large language models (LLMs), typified by ChatGPT and GPT-4, has
revolutionized natural language processing due to their remarkable language
understanding, generation, generalization, and reasoning abilities.
Consequently, recent research has sought to leverage LLMs to improve IR
systems. Given the rapid evolution of this research trajectory, it is necessary
to consolidate existing methodologies and provide nuanced insights through a
comprehensive overview. In this survey, we delve into the confluence of LLMs
and IR systems, including crucial aspects such as query rewriters, retrievers,
rerankers, and readers. Additionally, we explore promising directions, such as
search agents, within this expanding field.
Authors' comments: updated to version 2