Hyunjin Choi, Hyunjae Lee, Seongho Joe, Youngjune L. Gwon
Encoded representations from a pretrained deep learning model (e.g., BERT text embeddings, penultimate CNN layer activations of an image) convey a rich set of features beneficial for information retrieval. Embeddings for a particular modality of data occupy a high-dimensional space of its own, but it can be semantically aligned to another by a simple mapping without training a deep neural net. In this paper, we take a simple mapping computed from the least squares and singular value decomposition (SVD) for a solution to the Procrustes problem to serve a means to cross-modal information retrieval. That is, given information in one modality such as text, the mapping helps us locate a semantically equivalent data item in another modality such as image. Using off-the-shelf pretrained deep learning models, we have experimented the aforementioned simple cross-modal mappings in tasks of text-to-image and image-to-text retrieval. Despite simplicity, our mappings perform reasonably well reaching the highest accuracy of 77% on recall@10, which is comparable to those requiring costly neural net training and fine-tuning. We have improved the simple mappings by contrastive learning on the pretrained models. Contrastive learning can be thought as properly biasing the pretrained encoders to enhance the cross-modal mapping quality. We have further improved the performance by multilayer perceptron with gating (gMLP), a simple neural architecture.
Giovanni Pellegrini, Jacopo Bertolotti
Phase-retrieval techniques aim to recover the original signal from just the
modulus of its Fourier transform, which is usually much easier to measure than
its phase, but the standard iterative techniques tend to fail if only part of
the modulus information is available. We show that a neural network can be
trained to perform phase retrieval using only incomplete information, and we
discuss advantages and limitations of this approach.
Authors' comments: Submission to SciPost
Maximilian Du, Suraj Nair, Dorsa Sadigh, Chelsea Finn
Enabling robots to learn novel visuomotor skills in a data-efficient manner remains an unsolved problem with myriad challenges. A popular paradigm for tackling this problem is through leveraging large unlabeled datasets that have many behaviors in them and then adapting a policy to a specific task using a small amount of task-specific human supervision (i.e. interventions or demonstrations). However, how best to leverage the narrow task-specific supervision and balance it with offline data remains an open question. Our key insight in this work is that task-specific data not only provides new data for an agent to train on but can also inform the type of prior data the agent should use for learning. Concretely, we propose a simple approach that uses a small amount of downstream expert data to selectively query relevant behaviors from an offline, unlabeled dataset (including many sub-optimal behaviors). The agent is then jointly trained on the expert and queried data. We observe that our method learns to query only the relevant transitions to the task, filtering out sub-optimal or task-irrelevant data. By doing so, it is able to learn more effectively from the mix of task-specific and offline data compared to naively mixing the data or only using the task-specific data. Furthermore, we find that our simple querying approach outperforms more complex goal-conditioned methods by 20% across simulated and real robotic manipulation tasks from images. See https://sites.google.com/view/behaviorretrieval for videos and code.
Tobias Fink, Gabor Recski, Wojciech Kusa, Allan Hanbury
We discuss our experiments for COLIEE Task 1, a court case retrieval
competition using cases from the Federal Court of Canada. During experiments on
the training data we observe that passage level retrieval with rank fusion
outperforms document level retrieval. By explicitly adding extracted statute
information to the queries and documents we can further improve the results. We
submit two passage level runs to the competition, which achieve high recall but
low precision.
Authors' comments: Sixteenth International Workshop on Juris-informatics (JURISIN). 2022
Shengyao Zhuang, Linjun Shou, Jian Pei, Ming Gong, Houxing Ren, Guido Zuccon, Daxin Jiang
Current dense retrievers (DRs) are limited in their ability to effectively
process misspelled queries, which constitute a significant portion of query
traffic in commercial search engines. The main issue is that the pre-trained
language model-based encoders used by DRs are typically trained and fine-tuned
using clean, well-curated text data. Misspelled queries are typically not found
in the data used for training these models, and thus misspelled queries
observed at inference time are out-of-distribution compared to the data used
for training and fine-tuning. Previous efforts to address this issue have
focused on \textit{fine-tuning} strategies, but their effectiveness on
misspelled queries remains lower than that of pipelines that employ separate
state-of-the-art spell-checking components. To address this challenge, we
propose ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse
Retrieval), a novel re-training strategy for DRs that increases their
robustness to misspelled queries while preserving their effectiveness in
downstream retrieval tasks. ToRoDer utilizes an encoder-decoder architecture
where the encoder takes misspelled text with masked tokens as input and outputs
bottlenecked information to the decoder. The decoder then takes as input the
bottlenecked embeddings, along with token embeddings of the original text with
the misspelled tokens masked out. The pre-training task is to recover the
masked tokens for both the encoder and decoder. Our extensive experimental
results and detailed ablation studies show that DRs pre-trained with ToRoDer
exhibit significantly higher effectiveness on misspelled queries, sensibly
closing the gap with pipelines that use a separate, complex spell-checker
component, while retaining their effectiveness on correctly spelled queries.
Authors' comments: 10 pages, accepted at SIGIR-AP
Fuxiang Huang, Lei Zhang
Interactive Image Retrieval (IIR) aims to retrieve images that are generally
similar to the reference image but under the requested text modification. The
existing methods usually concatenate or sum the features of image and text
simply and roughly, which, however, is difficult to precisely change the local
semantics of the image that the text intends to modify. To solve this problem,
we propose a Language Guided Local Infiltration (LGLI) system, which fully
utilizes the text information and penetrates text features into image features
as much as possible. Specifically, we first propose a Language Prompt Visual
Localization (LPVL) module to generate a localization mask which explicitly
locates the region (semantics) intended to be modified. Then we introduce a
Text Infiltration with Local Awareness (TILA) module, which is deployed in the
network to precisely modify the reference image and generate image-text
infiltrated representation. Extensive experiments on various benchmark
databases validate that our method outperforms most state-of-the-art IIR
approaches.
Authors' comments: 10 pages, 9 figures, 4 tables, IEEE/CVF Conference on Computer Vision
and Pattern Recognition
Hermann Kroll, Christin Katharina Kreutz, Pascal Sackhoff, Wolf-Tilo Balke
Providing effective access paths to content is a key task in digital
libraries. Oftentimes, such access paths are realized through advanced query
languages, which, on the one hand, users may find challenging to learn or use,
and on the other, requires libraries to convert their content into a high
quality structured representation. As a remedy, narrative information access
proposes to query library content through structured patterns directly, to
ensure validity and coherence of information. However, users still find it
challenging to express their information needs in such patterns. Therefore,
this work bridges the gap by introducing a method that deduces patterns from
keyword searches. Moreover, our user studies with participants from the
biomedical domain show their acceptance of our prototypical system.
Authors' comments: Accepted at JCDL2023. 12 pages. Full Research Paper
Jian Zhu, Zhangmin Huang, Xiaohu Ruan, Yu Cui, Yongli Cheng, Lingfang Zeng
Learning the hash representation of multi-view heterogeneous data is an
important task in multimedia retrieval. However, existing methods fail to
effectively fuse the multi-view features and utilize the metric information
provided by the dissimilar samples, leading to limited retrieval precision.
Current methods utilize weighted sum or concatenation to fuse the multi-view
features. We argue that these fusion methods cannot capture the interaction
among different views. Furthermore, these methods ignored the information
provided by the dissimilar samples. We propose a novel deep metric multi-view
hashing (DMMVH) method to address the mentioned problems. Extensive empirical
evidence is presented to show that gate-based fusion is better than typical
methods. We introduce deep metric learning to the multi-view hashing problems,
which can utilize metric information of dissimilar samples. On the
MIR-Flickr25K, MS COCO, and NUS-WIDE, our method outperforms the current
state-of-the-art methods by a large margin (up to 15.28 mean Average Precision
(mAP) improvement).
Authors' comments: Accepted by IEEE ICME 2023
Xiang An, Jiankang Deng, Kaicheng Yang, Jaiwei Li, Ziyong Feng, Jia Guo, Jing Yang, Tongliang Liu
Modern image retrieval methods typically rely on fine-tuning pre-trained
encoders to extract image-level descriptors. However, the most widely used
models are pre-trained on ImageNet-1K with limited classes. The pre-trained
feature representation is therefore not universal enough to generalize well to
the diverse open-world classes. In this paper, we first cluster the large-scale
LAION400M into one million pseudo classes based on the joint textual and visual
features extracted by the CLIP model. Due to the confusion of label
granularity, the automatically clustered dataset inevitably contains heavy
inter-class conflict. To alleviate such conflict, we randomly select partial
inter-class prototypes to construct the margin-based softmax loss. To further
enhance the low-dimensional feature representation, we randomly select partial
feature dimensions when calculating the similarities between embeddings and
class-wise prototypes. The dual random partial selections are with respect to
the class dimension and the feature dimension of the prototype matrix, making
the classification conflict-robust and the feature embedding compact. Our
method significantly outperforms state-of-the-art unsupervised and supervised
image retrieval approaches on multiple benchmarks. The code and pre-trained
models are released to facilitate future research
https://github.com/deepglint/unicom.
Authors' comments: Accepted at ICLR2023
Yang Chen, Yanan Wang
The conjugate phase retrieval problem concerns the determination of a complex-valued function, up to a unimodular constant and conjugation, from its magnitude observations. It can also be considered as a conjugate phaseless sampling and reconstruction problem in an infinite dimensional space. In this paper, we first characterize the conjugate phase retrieval from the point evaluations in a shift-invariant space $\mathcal S(\phi)$, where the generator $\phi$ is a compactly supported real-valued function. If the generator $\phi$ has some spanning property, we also show that a conjugate phase retrievable function in $\mathcal S(\phi)$ can be reconstructed from its phaseless samples taken on a discrete set with finite sampling density. With additional phaseless measurements on the function derivative, for the B-spline generator $B_N$ of order $N\ge 3$ which does not have the spanning property, we find sets $\Gamma$ and $\Gamma'\subset (0,1)$ of cardinalities $2N-1$ and $2N-5$ respectively, such that a conjugate phase retrievable function $f$ in the spline space $\mathcal B_N$ can be determined from its phaseless Hermite samples $|f(\gamma)|, \gamma\in\Gamma+\Z$, and $|f'(\gamma)|, \gamma'\in\Gamma'+\Z$. An algorithm is proposed for the conjugate phase retrieval of piecewise polynomials from the Hermite samples. Our results provide illustrative examples of real conjugate phase retrievable frames for the complex finite dimensional space $\C^N$.
Trung-Nghia Le, Tam V. Nguyen c, Minh-Quan Le, Trong-Thuan Nguyen, Viet-Tham Huynh, Trong-Le Do, Khanh-Duy Le, Mai-Khiem Tran et al.
3D object retrieval is an important yet challenging task, which has drawn
more and more attention in recent years. While existing approaches have made
strides in addressing this issue, they are often limited to restricted settings
such as image and sketch queries, which are often unfriendly interactions for
common users. In order to overcome these limitations, this paper presents a
novel SHREC challenge track focusing on text-based fine-grained retrieval of 3D
animal models. Unlike previous SHREC challenge tracks, the proposed task is
considerably more challenging, requiring participants to develop innovative
approaches to tackle the problem of text-based retrieval. Despite the increased
difficulty, we believe that this task has the potential to drive useful
applications in practice and facilitate more intuitive interactions with 3D
objects. Five groups participated in our competition, submitting a total of 114
runs. While the results obtained in our competition are satisfactory, we note
that the challenges presented by this task are far from being fully solved. As
such, we provide insights into potential areas for future research and
improvements. We believe that we can help push the boundaries of 3D object
retrieval and facilitate more user-friendly interactions via vision-language
technologies.
Authors' comments: arXiv admin note: text overlap with arXiv:2304.05731
Trung-Nghia Le, Tam V. Nguyen, Minh-Quan Le, Trong-Thuan Nguyen, Viet-Tham Huynh, Trong-Le Do, Khanh-Duy Le, Mai-Khiem Tran et al.
The retrieval of 3D objects has gained significant importance in recent years due to its broad range of applications in computer vision, computer graphics, virtual reality, and augmented reality. However, the retrieval of 3D objects presents significant challenges due to the intricate nature of 3D models, which can vary in shape, size, and texture, and have numerous polygons and vertices. To this end, we introduce a novel SHREC challenge track that focuses on retrieving relevant 3D animal models from a dataset using sketch queries and expedites accessing 3D models through available sketches. Furthermore, a new dataset named ANIMAR was constructed in this study, comprising a collection of 711 unique 3D animal models and 140 corresponding sketch queries. Our contest requires participants to retrieve 3D models based on complex and detailed sketches. We receive satisfactory results from eight teams and 204 runs. Although further improvement is necessary, the proposed task has the potential to incentivize additional research in the domain of 3D object retrieval, potentially yielding benefits for a wide range of applications. We also provide insights into potential areas of future research, such as improving techniques for feature extraction and matching, and creating more diverse datasets to evaluate retrieval performance.
Andrea Bianchi, Giordano d'Aloisio, Francesca Marzi, Antinisca Di Marco
Reproducibility is a crucial aspect of scientific research that involves the ability to independently replicate experimental results by analysing the same data or repeating the same experiment. Over the years, many works have been proposed to make the results of the experiments actually reproducible. However, very few address the importance of data reproducibility, defined as the ability of independent researchers to retain the same dataset used as input for experimentation. Properly addressing the problem of data reproducibility is crucial because often just providing a link to the data is not enough to make the results reproducible. In fact, also proper metadata (e.g., preprocessing instruction) must be provided to make a dataset fully reproducible. In this work, our aim is to fill this gap by proposing a decision tree to sheperd researchers through the reproducibility of their datasets. In particular, this decision tree guides researchers through identifying if the dataset is actually reproducible and if additional metadata (i.e., additional resources needed to reproduce the data) must also be provided. This decision tree will be the foundation of a future application that will automate the data reproduction process by automatically providing the necessary metadata based on the particular context (e.g., data availability, data preprocessing, and so on). It is worth noting that, in this paper, we detail the steps to make a dataset retrievable, while we will detail other crucial aspects for reproducibility (e.g., dataset documentation) in future works.
Qiao Jin, Andrew Shin, Zhiyong Lu
Queries with similar information needs tend to have similar document clicks,
especially in biomedical literature search engines where queries are generally
short and top documents account for most of the total clicks. Motivated by
this, we present a novel architecture for biomedical literature search, namely
Log-Augmented DEnse Retrieval (LADER), which is a simple plug-in module that
augments a dense retriever with the click logs retrieved from similar training
queries. Specifically, LADER finds both similar documents and queries to the
given query by a dense retriever. Then, LADER scores relevant (clicked)
documents of similar queries weighted by their similarity to the input query.
The final document scores by LADER are the average of (1) the document
similarity scores from the dense retriever and (2) the aggregated document
scores from the click logs of similar queries. Despite its simplicity, LADER
achieves new state-of-the-art (SOTA) performance on TripClick, a recently
released benchmark for biomedical literature retrieval. On the frequent (HEAD)
queries, LADER largely outperforms the best retrieval model by 39% relative
NDCG@10 (0.338 v.s. 0.243). LADER also achieves better performance on the less
frequent (TORSO) queries with 11% relative NDCG@10 improvement over the
previous SOTA (0.303 v.s. 0.272). On the rare (TAIL) queries where similar
queries are scarce, LADER still compares favorably to the previous SOTA method
(NDCG@10: 0.310 v.s. 0.295). On all queries, LADER can improve the performance
of a dense retriever by 24%-37% relative NDCG@10 while not requiring additional
training, and further performance improvement is expected from more logs. Our
regression analysis has shown that queries that are more frequent, have higher
entropy of query similarity and lower entropy of document similarity, tend to
benefit more from log augmentation.
Authors' comments: SIGIR 2023
Yanru Xiao, Cong Wang, Xing Gao
The vulnerability in the algorithm supply chain of deep learning has imposed new challenges to image retrieval systems in the downstream. Among a variety of techniques, deep hashing is gaining popularity. As it inherits the algorithmic backend from deep learning, a handful of attacks are recently proposed to disrupt normal image retrieval. Unfortunately, the defense strategies in softmax classification are not readily available to be applied in the image retrieval domain. In this paper, we propose an efficient and unsupervised scheme to identify unique adversarial behaviors in the hamming space. In particular, we design three criteria from the perspectives of hamming distance, quantization loss and denoising to defend against both untargeted and targeted attacks, which collectively limit the adversarial space. The extensive experiments on four datasets demonstrate 2-23% improvements of detection rates with minimum computational overhead for real-time image queries.
Enrico Ventura
Ever since the last two decades of the past century pioneering studies in the
field of statistical physics had focused their efforts on developing models of
neural networks that could display memory storage and retrieval. Though many
associative memory models were easy to handle and still quite effective to
explain the basic memory retrieval processes in the brain, they were not
satisfactory under the biological point of view. It became clear to scientists
that a biologically realistic neural network should have respected typical
features that were observed in experiments of neurophysiology. This aspect has
led to the introduction of Balanced Networks, systems where excitatory and
inhibitory neurons balance their effect on each other as an emergent property
of the network dynamics. One of such models is the exhibition of a mean level
of neuronal activity (i.e. the average spiking rate of neurons) that is
univocally defined by a linear equation in the external input. This aspect
might help to reproduce what is measured in particular areas devoted to memory
storage, i.e. a persistent activity during the memory retrieval performance.
Even though progresses in the matter of balanced networks where achieved in the
last two decades, there is still no complete theory that conciliates memory
retrieval and balance in a network of neurons. The aim of this work is to
develop a biologically plausible model that presents both balance and memory
retrieval, building on a framework of mean field equations that can predict the
theoretical behaviour of the network under the choice of a set of control
parameters. We will thus measure the critical capacity of the system as a
function of these parameters, comparing the theoretical results with the
numerical simulations.
Authors' comments: Master Thesis in Theoretical Physics, 71 pages
Shulin Huang, Shirong Ma, Yangning Li, Yinghui Li, Hai-Tao Zheng
Entity Set Expansion (ESE) is a critical task aiming at expanding entities of
the target semantic class described by seed entities. Most existing ESE methods
are retrieval-based frameworks that need to extract contextual features of
entities and calculate the similarity between seed entities and candidate
entities. To achieve the two purposes, they iteratively traverse the corpus and
the entity vocabulary, resulting in poor efficiency and scalability.
Experimental results indicate that the time consumed by the retrieval-based ESE
methods increases linearly with entity vocabulary and corpus size. In this
paper, we firstly propose Generative Entity Set Expansion (GenExpan) framework,
which utilizes a generative pre-trained auto-regressive language model to
accomplish ESE task. Specifically, a prefix tree is employed to guarantee the
validity of entity generation, and automatically generated class names are
adopted to guide the model to generate target entities. Moreover, we propose
Knowledge Calibration and Generative Ranking to further bridge the gap between
generic knowledge of the language model and the goal of ESE task. For
efficiency, expansion time consumed by GenExpan is independent of entity
vocabulary and corpus size, and GenExpan achieves an average 600% speedup
compared to strong baselines. For expansion effectiveness, our framework
outperforms previous state-of-the-art ESE methods.
Authors' comments: Accepted by CIKM 2024 (FULL paper)
Daniel Campos, ChengXiang Zhai, Alessandro Magnani
The success of contextual word representations and advances in neural
information retrieval have made dense vector-based retrieval a standard
approach for passage and document ranking. While effective and efficient,
dual-encoders are brittle to variations in query distributions and noisy
queries. Data augmentation can make models more robust but introduces overhead
to training set generation and requires retraining and index regeneration. We
present Contrastive Alignment POst Training (CAPOT), a highly efficient
finetuning method that improves model robustness without requiring index
regeneration, the training set optimization, or alteration. CAPOT enables
robust retrieval by freezing the document encoder while the query encoder
learns to align noisy queries with their unaltered root. We evaluate CAPOT
noisy variants of MSMARCO, Natural Questions, and Trivia QA passage retrieval,
finding CAPOT has a similar impact as data augmentation with none of its
overhead.
Authors' comments: 8 pages, 6 figures, 30 tables
Jae Myung Kim, A. Sophia Koepke, Cordelia Schmid, Zeynep Akata
Cross-modal retrieval methods are the preferred tool to search databases for
the text that best matches a query image and vice versa. However, image-text
retrieval models commonly learn to memorize spurious correlations in the
training data, such as frequent object co-occurrence, instead of looking at the
actual underlying reasons for the prediction in the image. For image-text
retrieval, this manifests in retrieved sentences that mention objects that are
not present in the query image. In this work, we introduce ODmAP@k, an object
decorrelation metric that measures a model's robustness to spurious
correlations in the training data. We use automatic image and text
manipulations to control the presence of such object correlations in designated
test data. Additionally, our data synthesis technique is used to tackle model
biases due to spurious correlations of semantically unrelated objects in the
training data. We apply our proposed pipeline, which involves the finetuning of
image-text retrieval frameworks on carefully designed synthetic data, to three
state-of-the-art models for image-text retrieval. This results in significant
improvements for all three models, both in terms of the standard retrieval
performance and in terms of our object decorrelation metric. The code is
available at https://github.com/ExplainableML/Spurious_CM_Retrieval.
Authors' comments: CVPR'23 MULA Workshop
Fernando Giner
Information retrieval (IR) evaluation measures are cornerstones for
determining the suitability and task performance efficiency of retrieval
systems. Their metric and scale properties enable to compare one system against
another to establish differences or similarities. Based on the representational
theory of measurement, this paper determines these properties by exploiting the
information contained in a retrieval measure itself. It establishes the
intrinsic framework of a retrieval measure, which is the common scenario when
the domain set is not explicitly specified. A method to determine the metric
and scale properties of any retrieval measure is provided, requiring knowledge
of only some of its attained values. The method establishes three main
categories of retrieval measures according to their intrinsic properties. Some
common user-oriented and system-oriented evaluation measures are classified
according to the presented taxonomy.
Authors' comments: 23 pages