Dario Cioni, Lorenzo Berlincioni, Federico Becattini, Alberto del Bimbo
Cultural heritage applications and advanced machine learning models are
creating a fruitful synergy to provide effective and accessible ways of
interacting with artworks. Smart audio-guides, personalized art-related content
and gamification approaches are just a few examples of how technology can be
exploited to provide additional value to artists or exhibitions. Nonetheless,
from a machine learning point of view, the amount of available artistic data is
often not enough to train effective models. Off-the-shelf computer vision
modules can still be exploited to some extent, yet a severe domain shift is
present between art images and standard natural image datasets used to train
such models. As a result, this can lead to degraded performance. This paper
introduces a novel approach to address the challenges of limited annotated data
and domain shifts in the cultural heritage domain. By leveraging generative
vision-language models, we augment art datasets by generating diverse
variations of artworks conditioned on their captions. This augmentation
strategy enhances dataset diversity, bridging the gap between natural images
and artworks, and improving the alignment of visual cues with knowledge from
general-purpose datasets. The generated variations assist in training vision
and language models with a deeper understanding of artistic characteristics and
that are able to generate better captions with appropriate jargon.
Authors' comments: Accepted at ICCV 2023 4th Workshop on e-Heritage
Yi Bin, Haoxuan Li, Yahui Xu, Xing Xu, Yang Yang, Heng Tao Shen
Most existing cross-modal retrieval methods employ two-stream encoders with
different architectures for images and texts, \textit{e.g.}, CNN for images and
RNN/Transformer for texts. Such discrepancy in architectures may induce
different semantic distribution spaces and limit the interactions between
images and texts, and further result in inferior alignment between images and
texts. To fill this research gap, inspired by recent advances of Transformers
in vision tasks, we propose to unify the encoder architectures with
Transformers for both modalities. Specifically, we design a cross-modal
retrieval framework purely based on two-stream Transformers, dubbed
\textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image
Transformer, a text Transformer, and a hierarchical alignment module. With such
identical architectures, the encoders could produce representations with more
similar characteristics for images and texts, and make the interactions and
alignments between them much easier. Besides, to leverage the rich semantics,
we devise a hierarchical alignment scheme to explore multi-level
correspondences of different layers between images and texts. To evaluate the
effectiveness of the proposed HAT, we conduct extensive experiments on two
benchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that
HAT outperforms SOTA baselines by a large margin. Specifically, on two key
tasks, \textit{i.e.}, image-to-text and text-to-image retrieval, HAT achieves
7.6\% and 16.7\% relative score improvement of Recall@1 on MSCOCO, and 4.4\%
and 11.6\% on Flickr30k respectively. The code is available at
\url{https://github.com/LuminosityX/HAT}.
Authors' comments: Accepted at ACM Multimedia 2023
Ting Wang, Xiaotong Wu, Jizhou Li, Chao Wang
X-ray microspectroscopic techniques are essential for studying morphological
and chemical changes in materials, providing high-resolution structural and
spectroscopic information. However, its practical data analysis for reliably
retrieving the chemical states remains a major obstacle to accelerating the
fundamental understanding of materials in many research fields. In this work,
we propose a novel data formulation model for X-ray microspectroscopy and
develop a dedicated unmixing framework to solve this problem, which is robust
to noise and spectral variability. Moreover, this framework is not limited to
the analysis of two-state material chemistry, making it an effective
alternative to conventional and widely-used methods. In addition, an
alternative directional multiplier method with provable convergence is applied
to obtain the solution efficiently. Our framework can accurately identify and
characterize chemical states in complex and heterogeneous samples, even under
challenging conditions such as low signal-to-noise ratios and overlapping
spectral features. Extensive experimental results on simulated and real
datasets demonstrate its effectiveness and reliability.
Authors' comments: 12 pages
Shashank Gupta
Question answering is a task that answers factoid questions using a large
collection of documents. It aims to provide precise answers in response to the
user's questions in natural language. Question answering relies on efficient
passage retrieval to select candidate contexts, where traditional sparse vector
space models, such as TF-IDF or BM25, are the de facto method. On the web,
there is no single article that could provide all the possible answers
available on the internet to the question of the problem asked by the user. The
existing Dense Passage Retrieval model has been trained on Wikipedia dump from
Dec. 20, 2018, as the source documents for answering questions. Question
answering (QA) has made big strides with several open-domain and machine
comprehension systems built using large-scale annotated datasets. However, in
the clinical domain, this problem remains relatively unexplored. According to
multiple surveys, Biomedical Questions cannot be answered correctly from
Wikipedia Articles. In this work, we work on the existing DPR framework for the
biomedical domain and retrieve answers from the Pubmed articles which is a
reliable source to answer medical questions. When evaluated on a BioASQ QA
dataset, our fine-tuned dense retriever results in a 0.81 F1 score.
Authors' comments: 6 pages, 5 figures. arXiv admin note: text overlap with
arXiv:2004.04906 by other authors
Victor A. P. Magri, Peter Lindstrom
In scientific simulations, observations, and experiments, the cost of
transferring data to and from disk and across networks has become a significant
bottleneck that particularly impacts subsequent data analysis and
visualization. To address this challenge, compression techniques have been
widely adopted. However, traditional lossy compression approaches often require
setting error tolerances conservatively to respect the numerical sensitivities
of a wide variety of post hoc data analyses, some of which may not even be
known a priori. Progressive data compression and retrieval has emerged as a
solution, allowing for the adaptive handling of compressed data according to
the needs of a given post-processing task. However, few analysis algorithms
natively support progressive data processing, and adapting compression
techniques, file formats, client/server frameworks, and APIs to support
progressivity can be challenging. This work presents a general framework that
supports progressive-precision data queries independently of the underlying
data compressor or number representation. Our approach is based on a
multiple-component representation that successively, with each new component,
reduces the error between the original and compressed field, allowing each
field in the progressive sequence to be expressed as a partial sum of
components. We have implemented our approach on top of four popular scientific
data compressors and have evaluated its behavior on several real-world data
sets from the SDRBench collection. Numerical results indicate that our
framework is effective in terms of accuracy compared to each of the standalone
compressors it builds upon. In addition, (de)compression time is proportional
to the number and granularity of components. Finally, our framework allows for
fully lossless compression using lossy compressors when a sufficient number of
components are employed.
Authors' comments: To be published in Proceedings of IEEE VIS 2023, IEEE Transactions on
Visualization and Computer Graphics
Youyang Ng, Daisuke Miyashita, Yasuto Hoshi, Yasuhiro Morioka, Osamu Torii, Tomoya Kodama, Jun Deguchi
Large Language Model (LLM) based Generative AI systems have seen significant
progress in recent years. Integrating a knowledge retrieval architecture allows
for seamless integration of private data into publicly available Generative AI
systems using pre-trained LLM without requiring additional model fine-tuning.
Moreover, Retrieval-Centric Generation (RCG) approach, a promising future
research direction that explicitly separates roles of LLMs and retrievers in
context interpretation and knowledge memorization, potentially leads to more
efficient implementation. SimplyRetrieve is an open-source tool with the goal
of providing a localized, lightweight, and user-friendly interface to these
sophisticated advancements to the machine learning community. SimplyRetrieve
features a GUI and API based RCG platform, assisted by a Private Knowledge Base
Constructor and a Retrieval Tuning Module. By leveraging these capabilities,
users can explore the potential of RCG for improving generative AI performance
while maintaining privacy standards. The tool is available at
https://github.com/RCGAI/SimplyRetrieve with an MIT license.
Authors' comments: 12 pages, 6 figures
Roman Duek, Aleksander Wawer, Christopher Galias, Lidia Wojciechowska
The aim of this article is to investigate the fine-tuning potential of natural language inference (NLI) data to improve information retrieval and ranking. We demonstrate this for both English and Polish languages, using data from one of the largest Polish e-commerce sites and selected open-domain datasets. We employ both monolingual and multilingual sentence encoders fine-tuned by a supervised method utilizing contrastive loss and NLI data. Our results point to the fact that NLI fine-tuning increases the performance of the models in both tasks and both languages, with the potential to improve mono- and multilingual models. Finally, we investigate uniformity and alignment of the embeddings to explain the effect of NLI-based fine-tuning for an out-of-domain use-case.
Haoxiang Shi, Sumio Fujita, Tetsuya Sakai
Domain transfer is a prevalent challenge in modern neural Information Retrieval (IR). To overcome this problem, previous research has utilized domain-specific manual annotations and synthetic data produced by consistency filtering to finetune a general ranker and produce a domain-specific ranker. However, training such consistency filters are computationally expensive, which significantly reduces the model efficiency. In addition, consistency filtering often struggles to identify retrieval intentions and recognize query and corpus distributions in a target domain. In this study, we evaluate a more efficient solution: replacing the consistency filter with either direct pseudo-labeling, pseudo-relevance feedback, or unsupervised keyword generation methods for achieving consistent filtering-free unsupervised dense retrieval. Our extensive experimental evaluations demonstrate that, on average, TextRank-based pseudo relevance feedback outperforms other methods. Furthermore, we analyzed the training and inference efficiency of the proposed paradigm. The results indicate that filtering-free unsupervised learning can continuously improve training and inference efficiency while maintaining retrieval performance. In some cases, it can even improve performance based on particular datasets.
Zhengyang Mao, Wei Ju, Yifang Qin, Xiao Luo, Ming Zhang
Graph classification is a crucial task in many real-world multimedia
applications, where graphs can represent various multimedia data types such as
images, videos, and social networks. Previous efforts have applied graph neural
networks (GNNs) in balanced situations where the class distribution is
balanced. However, real-world data typically exhibit long-tailed class
distributions, resulting in a bias towards the head classes when using GNNs and
limited generalization ability over the tail classes. Recent approaches mainly
focus on re-balancing different classes during model training, which fails to
explicitly introduce new knowledge and sacrifices the performance of the head
classes. To address these drawbacks, we propose a novel framework called
Retrieval Augmented Hybrid Network (RAHNet) to jointly learn a robust feature
extractor and an unbiased classifier in a decoupled manner. In the feature
extractor training stage, we develop a graph retrieval module to search for
relevant graphs that directly enrich the intra-class diversity for the tail
classes. Moreover, we innovatively optimize a category-centered supervised
contrastive loss to obtain discriminative representations, which is more
suitable for long-tailed scenarios. In the classifier fine-tuning stage, we
balance the classifier weights with two weight regularization techniques, i.e.,
Max-norm and weight decay. Experiments on various popular benchmarks verify the
superiority of the proposed method against state-of-the-art approaches.
Authors' comments: Accepted by the ACM International Conference on Multimedia (MM) 2023
Saipraneeth Devunuri, Shirin Qiam, Lewis Lehe
The General Transit Feed Specification (GTFS) standard for publishing transit
data is ubiquitous. GTFS being tabular data, with information spread across
different files, necessitates specialized tools or packages to retrieve
information. Concurrently, the use of Large Language Models(LLMs) for text and
information retrieval is growing. The idea of this research is to see if the
current widely adopted LLMs (ChatGPT) are able to understand GTFS and retrieve
information from GTFS using natural language instructions without explicitly
providing information. In this research, we benchmark OpenAI's GPT-3.5-Turbo
and GPT-4 LLMs which are the backbone of ChatGPT. ChatGPT demonstrates a
reasonable understanding of GTFS by answering 59.7% (GPT-3.5-Turbo) and 73.3%
(GPT-4) of our multiple-choice questions (MCQ) correctly. Furthermore, we
evaluated the LLMs on information extraction tasks using a filtered GTFS feed
containing four routes. We found that program synthesis techniques outperformed
zero-shot approaches, achieving up to 93% (90%) accuracy for simple queries and
61% (41%) for complex ones using GPT-4 (GPT-3.5-Turbo).
Authors' comments: 22 pages, 8 figures, 1 table, Public Transport
Kaibin Tian, Ruixiang Zhao, Hu Hu, Runquan Xie, Fengzong Lian, Zhanhui Kang, Xirong Li
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods are dominating. Compared to CLIP4Clip which is efficient and compact, the state-of-the-art models tend to compute video-text similarity by fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR into doubt. For efficient T2VR, we propose TeachCLIP with multi-grained teaching to let a CLIP4Clip based student network learn from more advanced yet computationally heavy models such as X-CLIP, TS2-Net and X-Pool . To improve the student's learning capability, we add an Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage/computation overhead at the retrieval stage. While attentive weights produced by AFA are commonly used for combining frame-level features, we propose a novel use of the weights to let them imitate frame-text relevance estimated by the teacher network. As such, AFA provides a fine-grained learning (teaching) channel for the student (teacher). Extensive experiments on multiple public datasets justify the viability of the proposed method.
Evert Nasedkin, Paul Mollière, Jason Wang, Faustine Cantalloube, Laura Kreidberg, Laurent Pueyo, Tomas Stolker, Arthur Vigan
Many post-processing algorithms have been developed in order to better
separate the signal of a companion from the bright light of the host star, but
the effect of such algorithms on the shape of exoplanet spectra extracted from
integral field spectrograph data is poorly understood. The resulting spectra
are affected by noise that is correlated in wavelength space due to both
optical and data processing effects. Within the framework of Bayesian
atmospheric retrievals, we aim to understand how these correlations and other
systematic effects impact the inferred physical parameters. We consider three
algorithms (KLIP, PynPoint and ANDROMEDA), optimizing the choice of algorithmic
parameters using a series of injection tests into archival SPHERE and GPI data
of the HR 8799 system. The wavelength-dependent covariance matrix is calculated
to provide a measure of instrumental and algorithmic systematics. We perform
atmospheric retrievals using petitRADTRANS on optimally extracted spectra to
measure how these data processing systematics influence the retrieved parameter
distributions. The choice of data processing algorithm and parameters
significantly impact the accuracy of retrieval results, with the mean posterior
parameter bias ranging from 1 to 3 $\sigma$ from the true input parameters.
Including the full covariance matrix in the likelihood improves the accuracy of
inferred parameters, and cannot be accounted for using ad hoc scaling
parameters in the retrieval framework. Using the Bayesian information criterion
and other statistical measures as a heuristic goodness-of-fit metrics, the
retrievals including the full covariance matrix are favoured when compared to
using only the diagonal elements.
Authors' comments: 22 pages, 13 figures, accepted to Astronomy & Astrophysics
Andreas Chari, Sean MacAvaney, Iadh Ounis
One advantage of neural ranking models is that they are meant to generalise
well in situations of synonymity i.e. where two words have similar or identical
meanings. In this paper, we investigate and quantify how well various ranking
models perform in a clear-cut case of synonymity: when words are simply
expressed in different surface forms due to regional differences in spelling
conventions (e.g., color vs colour). We first explore the prevalence of
American and British English spelling conventions in datasets used for the
pre-training, training and evaluation of neural retrieval methods, and find
that American spelling conventions are far more prevalent. Despite these biases
in the training data, we find that retrieval models often generalise well in
this case of synonymity. We explore the effect of document spelling
normalisation in retrieval and observe that all models are affected by
normalising the document's spelling. While they all experience a drop in
performance when normalised to a different spelling convention than that of the
query, we observe varied behaviour when the document is normalised to share the
query spelling convention: lexical models show improvements, dense retrievers
remain unaffected, and re-rankers exhibit contradictory behaviour.
Authors' comments: 10 pages, 3 tables, short paper published in SIGIR '23
Philipp Grohs, Lukas Liehr, Martin Rathmair
We study the determination of functions in Fock space from samples of their
absolute value, known as the phase retrieval problem in Fock space. An
important finding in this research field asserts that phaseless sampling on
lattices of arbitrary density renders the problem unsolvable. The present study
establishes solvability when using irregular sampling sets of the form $A \cup
B \cup C$, where $A, B,$ and $C$ constitute perturbations of a Liouville set,
i.e., a set with the property that all functions in Fock space bounded on the
set are constant. The sets $A, B,$ and $C$ adhere to specific geometrical
conditions of closeness and noncollinearity. We show that these conditions are
sufficiently generic so as to allow the perturbations to be chosen also at
random. By proving that Liouville sets occupy an intermediate position between
sets of stable sampling and sets of uniqueness, we obtain the first
construction of uniqueness sets for the phase retrieval problem in Fock space
having a finite density. The established results apply to the Gabor phase
retrieval problem in subspaces of $L^2(\mathbb{R})$, where we derive additional
reductions of the size of uniqueness sets: for the class of real-valued
functions, uniqueness is achieved from two perturbed lattices; for the class of
even real-valued functions, a single perturbation suffices, resulting in a
separated set.
Authors' comments: 36 pages, 5 figures, incorporated referee suggestions
Bhoomeendra Singh Sisodiya, Narendra Babu Unnam, P. Krishna Reddy, Apala Das, K. V. K. Santhy, V. Balakista Reddy
Developing methods for extracting relevant legal information to aid legal
practitioners is an active research area. In this regard, research efforts are
being made by leveraging different kinds of information, such as meta-data,
citations, keywords, sentences, paragraphs, etc. Similar to any text document,
legal documents are composed of paragraphs. In this paper, we have analyzed the
resourcefulness of paragraph-level information in capturing similarity among
judgments for improving the performance of precedence retrieval. We found that
the paragraph-level methods could capture the similarity among the judgments
with only a few paragraph interactions and exhibit more discriminating power
over the baseline document-level method. Moreover, the comparison results on
two benchmark datasets for the precedence retrieval on the Indian supreme court
judgments task show that the paragraph-level methods exhibit comparable
performance with the state-of-the-art methods
Authors' comments: 5 pages , 3 figures, ICAIL 2023
Wenjie Mei, Andrew M. Maiden
Set projection algorithms are a class of algorithms used in ptychography to
help improve the quality of the reconstructed images. The set projection step
is important because it helps to ensure that the reconstructed image satisfies
the physical constraints, which can improve the quality of the final result. A
new projection algorithm that combines the advantages of the existing
algorithms is proposed and offers the possibility of a parallel algorithm for
iterative algorithms.
Authors' comments: Presented in ISCS23
Micheal Abaho, Yousef H. Alfaifi
Injecting textual information into knowledge graph (KG) entity representations has been a worthwhile expedition in terms of improving performance in KG oriented tasks within the NLP community. External knowledge often adopted to enhance KG embeddings ranges from semantically rich lexical dependency parsed features to a set of relevant key words to entire text descriptions supplied from an external corpus such as wikipedia and many more. Despite the gains this innovation (Text-enhanced KG embeddings) has made, the proposal in this work suggests that it can be improved even further. Instead of using a single text description (which would not sufficiently represent an entity because of the inherent lexical ambiguity of text), we propose a multi-task framework that jointly selects a set of text descriptions relevant to KG entities as well as align or augment KG embeddings with text descriptions. Different from prior work that plugs formal entity descriptions declared in knowledge bases, this framework leverages a retriever model to selectively identify richer or highly relevant text descriptions to use in augmenting entities. Furthermore, the framework treats the number of descriptions to use in augmentation process as a parameter, which allows the flexibility of enumerating across several numbers before identifying an appropriate number. Experiment results for Link Prediction demonstrate a 5.5% and 3.5% percentage increase in the Mean Reciprocal Rank (MRR) and Hits@10 scores respectively, in comparison to text-enhanced knowledge graph augmentation methods using traditional CNNs.
Yan Gong, Georgina Cosma
Visual-Semantic Embedding (VSE) networks can help search engines better understand the meaning behind visual content and associate it with relevant textual information, leading to more accurate search results. VSE networks can be used in cross-modal search engines to embed image and textual descriptions in a shared space, enabling image-to-text and text-to-image retrieval tasks. However, the full potential of VSE networks for search engines has yet to be fully explored. This paper presents Boon, a novel cross-modal search engine that combines two state-of-the-art networks: the GPT-3.5-turbo large language model, and the VSE network VITR (VIsion Transformers with Relation-focused learning) to enhance the engine's capabilities in extracting and reasoning with regional relationships in images. VITR employs encoders from CLIP that were trained with 400 million image-description pairs and it was fine-turned on the RefCOCOg dataset. Boon's neural-based components serve as its main functionalities: 1) a 'cross-modal search engine' that enables end-users to perform image-to-text and text-to-image retrieval. 2) a 'multi-lingual conversational AI' component that enables the end-user to converse about one or more images selected by the end-user. Such a feature makes the search engine accessible to a wide audience, including those with visual impairments. 3) Boon is multi-lingual and can take queries and handle conversations about images in multiple languages. Boon was implemented using the Django and PyTorch frameworks. The interface and capabilities of the Boon search engine are demonstrated using the RefCOCOg dataset, and the engine's ability to search for multimedia through the web is facilitated by Google's API.
Yan Gong, Georgina Cosma, Axel Finke
Creating an intelligent search and retrieval system for artwork images, particularly paintings, is crucial for documenting cultural heritage, fostering wider public engagement, and advancing artistic analysis and interpretation. Visual-Semantic Embedding (VSE) networks are deep learning models used for information retrieval, which learn joint representations of textual and visual data, enabling 1) cross-modal search and retrieval tasks, such as image-to-text and text-to-image retrieval; and 2) relation-focused retrieval to capture entity relationships and provide more contextually relevant search results. Although VSE networks have played a significant role in cross-modal information retrieval, their application to painting datasets, such as ArtUK, remains unexplored. This paper introduces BoonArt, a VSE-based cross-modal search engine that allows users to search for images using textual queries, and to obtain textual descriptions along with the corresponding images when using image queries. The performance of BoonArt was evaluated using the ArtUK dataset. Experimental evaluations revealed that BoonArt achieved 97% Recall@10 for image-to-text retrieval, and 97.4% Recall@10 for text-to-image Retrieval. By bridging the gap between textual and visual modalities, BoonArt provides a much-improved search performance compared to traditional search engines, such as the one provided by the ArtUK website. BoonArt can be utilised to work with other artwork datasets.
Sumit Sharma, Sarika Jain
Situation awareness is a crucial cognitive skill that enables individuals to perceive, comprehend, and project the current state of their environment accurately. It involves being conscious of relevant information, understanding its meaning, and using that understanding to make well-informed decisions. Awareness systems often need to integrate new knowledge and adapt to changing environments. Ontology reasoning facilitates knowledge integration and evolution, allowing for seamless updates and expansions of the ontology. With the consideration of above, we are providing a quick review on semantic information retrieval and ontology engineering to understand the emerging challenges and future research. In the review we have found that the ontology reasoning addresses the limitations of traditional systems by providing a formal, flexible, and scalable framework for knowledge representation, reasoning, and inference.