Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger et al.
Large language models (LLMs) have shown promise in proving formal theorems
using proof assistants such as Lean. However, existing methods are difficult to
reproduce or build on, due to private code, data, and large compute
requirements. This has created substantial barriers to research on machine
learning methods for theorem proving. This paper removes these barriers by
introducing LeanDojo: an open-source Lean playground consisting of toolkits,
data, models, and benchmarks. LeanDojo extracts data from Lean and enables
interaction with the proof environment programmatically. It contains
fine-grained annotations of premises in proofs, providing valuable data for
premise selection: a key bottleneck in theorem proving. Using this data, we
develop ReProver (Retrieval-Augmented Prover): an LLM-based prover augmented
with retrieval for selecting premises from a vast math library. It is
inexpensive and needs only one GPU week of training. Our retriever leverages
LeanDojo's program analysis capability to identify accessible premises and hard
negative examples, which makes retrieval much more effective. Furthermore, we
construct a new benchmark consisting of 98,734 theorems and proofs extracted
from Lean's math library. It features challenging data split requiring the
prover to generalize to theorems relying on novel premises that are never used
in training. We use this benchmark for training and evaluation, and
experimental results demonstrate the effectiveness of ReProver over
non-retrieval baselines and GPT-4. We thus provide the first set of open-source
LLM-based theorem provers without any proprietary datasets and release it under
a permissive MIT license to facilitate further research.
Authors' comments: Accepted to NeurIPS 2023 (Datasets and Benchmarks Track) as an oral
presentation. Data, code, and models available at https://leandojo.org/
Zhong Ji, Zhihao Li, Yan Zhang, Haoran Wang, Yanwei Pang, Xuelong Li
As a promising field, Multi-Query Image Retrieval (MQIR) aims at searching for the semantically relevant image given multiple region-specific text queries. Existing works mainly focus on a single-level similarity between image regions and text queries, which neglects the hierarchical guidance of multi-level similarities and results in incomplete alignments. Besides, the high-level semantic correlations that intrinsically connect different region-query pairs are rarely considered. To address above limitations, we propose a novel Hierarchical Matching and Reasoning Network (HMRN) for MQIR. It disentangles MQIR into three hierarchical semantic representations, which is responsible to capture fine-grained local details, contextual global scopes, and high-level inherent correlations. HMRN comprises two modules: Scalar-based Matching (SM) module and Vector-based Reasoning (VR) module. Specifically, the SM module characterizes the multi-level alignment similarity, which consists of a fine-grained local-level similarity and a context-aware global-level similarity. Afterwards, the VR module is developed to excavate the potential semantic correlations among multiple region-query pairs, which further explores the high-level reasoning similarity. Finally, these three-level similarities are aggregated into a joint similarity space to form the ultimate similarity. Extensive experiments on the benchmark dataset demonstrate that our HMRN substantially surpasses the current state-of-the-art methods. For instance, compared with the existing best method Drill-down, the metric R@1 in the last round is improved by 23.4%. Our source codes will be released at https://github.com/LZH-053/HMRN.
Jose M Munoz, Ilyes Batatia, Christoph Ortner, Francesco Romeo
Deep learning approaches for jet tagging in high-energy physics are characterized as black boxes that process a large amount of information from which it is difficult to extract key distinctive observables. In this proceeding, we present an alternative to deep learning approaches, Boost Invariant Polynomials, which enables direct analysis of simple analytic expressions representing the most important features in a given task. Further, we show how this approach provides an extremely low dimensional classifier with a minimum set of features representing %effective discriminating physically relevant observables and how it consequently speeds up the algorithm execution, with relatively close performance to the algorithm using the full information.
Marco Peer, Robert Sablatnig
This paper proposes a deep-learning-based approach to writer retrieval and
identification for papyri, with a focus on identifying fragments associated
with a specific writer and those corresponding to the same image. We present a
novel neural network architecture that combines a residual backbone with a
feature mixing stage to improve retrieval performance, and the final descriptor
is derived from a projection layer. The methodology is evaluated on two
benchmarks: PapyRow, where we achieve a mAP of 26.6 % and 24.9 % on writer and
page retrieval, and HisFragIR20, showing state-of-the-art performance (44.0 %
and 29.3 % mAP). Furthermore, our network has an accuracy of 28.7 % for writer
identification. Additionally, we conduct experiments on the influence of two
binarization techniques on fragments and show that binarizing does not enhance
performance. Our code and models are available to the community.
Authors' comments: accepted for HIP@ICDAR2023
Radu Balan, Efstratos Tsoukanis
This paper discusses the connection between the phase retrieval problem and
permutation invariant embeddings. We show that the real phase retrieval problem
for $\mathbb{R}^d/O(1)$ is equivalent to Euclidean embeddings of the quotient
space $\mathbb{R}^{2\times d}/S_2$ performed by the sorting encoder introduced
in an earlier work. In addition, this relationship provides us with inversion
algorithms of the orbits induced by the group of permutation matrices.
Authors' comments: Presented at the SampTA 2023 conference, July 2023, Yale University,
New Haven, CT
Soumya Chatterjee, Omar Khattab, Simran Arora
We introduce and define the novel problem of multi-distribution information
retrieval (IR) where given a query, systems need to retrieve passages from
within multiple collections, each drawn from a different distribution. Some of
these collections and distributions might not be available at training time. To
evaluate methods for multi-distribution retrieval, we design three benchmarks
for this task from existing single-distribution datasets, namely, a dataset
based on question answering and two based on entity matching. We propose simple
methods for this task which allocate the fixed retrieval budget (top-k
passages) strategically across domains to prevent the known domains from
consuming most of the budget. We show that our methods lead to an average of
3.8+ and up to 8.0 points improvements in Recall@100 across the datasets and
that improvements are consistent when fine-tuning different base retrieval
models. Our benchmarks are made publicly available.
Authors' comments: REML @ SIGIR 2023; 9 pages, 8 figures
Hao Cheng, Shuo Wang, Wensheng Lu, Wei Zhang, Mingyang Zhou, Kezhong Lu, Hao Liao
Explainable recommendation is a technique that combines prediction and generation tasks to produce more persuasive results. Among these tasks, textual generation demands large amounts of data to achieve satisfactory accuracy. However, historical user reviews of items are often insufficient, making it challenging to ensure the precision of generated explanation text. To address this issue, we propose a novel model, ERRA (Explainable Recommendation by personalized Review retrieval and Aspect learning). With retrieval enhancement, ERRA can obtain additional information from the training sets. With this additional information, we can generate more accurate and informative explanations. Furthermore, to better capture users' preferences, we incorporate an aspect enhancement component into our model. By selecting the top-n aspects that users are most concerned about for different items, we can model user representation with more relevant details, making the explanation more persuasive. To verify the effectiveness of our model, extensive experiments on three datasets show that our model outperforms state-of-the-art baselines (for example, 3.4% improvement in prediction and 15.8% improvement in explanation for TripAdvisor).
Julia Henkel, Genc Hoxha, Gencer Sumbul, Lars Möllenbrok, Begüm Demir
Deep metric learning (DML) based methods have been found very effective for
content-based image retrieval (CBIR) in remote sensing (RS). For accurately
learning the model parameters of deep neural networks, most of the DML methods
require a high number of annotated training images, which can be costly to
gather. To address this problem, in this paper we present an annotation cost
efficient active learning (AL) method (denoted as ANNEAL). The proposed method
aims to iteratively enrich the training set by annotating the most informative
image pairs as similar or dissimilar, while accurately modelling a deep metric
space. This is achieved by two consecutive steps. In the first step the
pairwise image similarity is modelled based on the available training set.
Then, in the second step the most uncertain and diverse (i.e., informative)
image pairs are selected to be annotated. Unlike the existing AL methods for
CBIR, at each AL iteration of ANNEAL a human expert is asked to annotate the
most informative image pairs as similar/dissimilar. This significantly reduces
the annotation cost compared to annotating images with land-use/land cover
class labels. Experimental results show the effectiveness of our method. The
code of ANNEAL is publicly available at https://git.tu-berlin.de/rsim/ANNEAL.
Authors' comments: Accepted at IEEE International Geoscience and Remote Sensing
Symposium (IGARSS) 2023. Our code is available at
https://git.tu-berlin.de/rsim/ANNEAL
Sanne Bloot, Yamila Miguel, Michaël Bazot, Saburo Howard
The mass and distribution of metals in the interiors of exoplanets are
essential for constraining their formation and evolution processes.
Nevertheless, with only masses and radii measured, the determination of
exoplanet interior structures is degenerate, and so far simplified assumptions
have mostly been used to derive planetary metallicities. In this work, we
present a method based on a state-of-the-art interior code, recently used for
Jupiter, and a Bayesian framework, to explore the possibility of retrieving the
interior structure of exoplanets. We use masses, radii, equilibrium
temperatures, and measured atmospheric metallicities to retrieve planetary bulk
metallicities and core masses. Following results on the giant planets in the
solar system and recent development in planet formation, we implement two
interior structure models: one with a homogeneous envelope and one with an
inhomogeneous one. Our method is first evaluated using a test planet and then
applied to a sample of 37 giant exoplanets with observed atmospheric
metallicities from the pre-JWST era. Although neither internal structure model
is preferred with the current data, it is possible to obtain information on the
interior properties of the planets, such as the core mass, through atmospheric
measurements in both cases. We present updated metal mass fractions, in
agreement with recent results on giant planets in the solar system.
Authors' comments: Accepted for publication in MNRAS
Genta Indra Winata, Liang-Kang Huang, Soumya Vadlamannati, Yash Chandarana
Transformer-based language models have achieved remarkable success in
few-shot in-context learning and drawn a lot of research interest. However,
these models' performance greatly depends on the choice of the example prompts
and also has high variability depending on how samples are chosen. In this
paper, we conduct a comprehensive study of retrieving semantically similar
few-shot samples and using them as the context, as it helps the model decide
the correct label without any gradient update in the multilingual and
cross-lingual settings. We evaluate the proposed method on five natural
language understanding datasets related to intent detection, question
classification, sentiment analysis, and topic classification. The proposed
method consistently outperforms random sampling in monolingual and
cross-lingual tasks in non-English languages.
Authors' comments: 9 pages
Yonggang Jin, Chenxu Wang, Tianyu Zheng, Liuyu Xiang, Yaodong Yang, Junge Zhang, Jie Fu, Zhaofeng He
Deep reinforcement learning algorithms are usually impeded by sampling inefficiency, heavily depending on multiple interactions with the environment to acquire accurate decision-making capabilities. In contrast, humans rely on their hippocampus to retrieve relevant information from past experiences of relevant tasks, which guides their decision-making when learning a new task, rather than exclusively depending on environmental interactions. Nevertheless, designing a hippocampus-like module for an agent to incorporate past experiences into established reinforcement learning algorithms presents two challenges. The first challenge involves selecting the most relevant past experiences for the current task, and the second challenge is integrating such experiences into the decision network. To address these challenges, we propose a novel method that utilizes a retrieval network based on task-conditioned hypernetwork, which adapts the retrieval network's parameters depending on the task. At the same time, a dynamic modification mechanism enhances the collaborative efforts between the retrieval and decision networks. We evaluate the proposed method across various tasks within a multitask scenario in the Minigrid environment. The experimental results demonstrate that our proposed method significantly outperforms strong baselines.
Xin Cong. Bowen Yu, Mengcheng Fang, Tingwen Liu, Haiyang Yu, Zhongkai Hu, Fei Huang, Yongbin Li, Bin Wang
Universal Information Extraction~(Universal IE) aims to solve different
extraction tasks in a uniform text-to-structure generation manner. Such a
generation procedure tends to struggle when there exist complex information
structures to be extracted. Retrieving knowledge from external knowledge bases
may help models to overcome this problem but it is impossible to construct a
knowledge base suitable for various IE tasks. Inspired by the fact that large
amount of knowledge are stored in the pretrained language models~(PLM) and can
be retrieved explicitly, in this paper, we propose MetaRetriever to retrieve
task-specific knowledge from PLMs to enhance universal IE. As different IE
tasks need different knowledge, we further propose a Meta-Pretraining Algorithm
which allows MetaRetriever to quicktly achieve maximum task-specific retrieval
performance when fine-tuning on downstream IE tasks. Experimental results show
that MetaRetriever achieves the new state-of-the-art on 4 IE tasks, 12 datasets
under fully-supervised, low-resource and few-shot scenarios.
Authors' comments: Accepted to ACL 2023
Huang Xie, Khazar Khorrami, Okko Räsänen, Tuomas Virtanen
This paper explores grading text-based audio retrieval relevances with
crowdsourcing assessments. Given a free-form text (e.g., a caption) as a query,
crowdworkers are asked to grade audio clips using numeric scores (between 0 and
100) to indicate their judgements of how much the sound content of an audio
clip matches the text, where 0 indicates no content match at all and 100
indicates perfect content match. We integrate the crowdsourced relevances into
training and evaluating text-based audio retrieval systems, and evaluate the
effect of using them together with binary relevances from audio captioning.
Conventionally, these binary relevances are defined by captioning-based
audio-caption pairs, where being positive indicates that the caption describes
the paired audio, and being negative applies to all other pairs. Experimental
results indicate that there is no clear benefit from incorporating crowdsourced
relevances alongside binary relevances when the crowdsourced relevances are
binarized for contrastive learning. Conversely, the results suggest that using
only binary relevances defined by captioning-based audio-caption pairs is
sufficient for contrastive learning.
Authors' comments: Accepted at DCASE 2023 Workshop
Yuqi Zhang, Qi Qian, Hongsong Wang, Chong Liu, Weihua Chen, Fan Wang
Visual retrieval tasks such as image retrieval and person re-identification
(Re-ID) aim at effectively and thoroughly searching images with similar content
or the same identity. After obtaining retrieved examples, re-ranking is a
widely adopted post-processing step to reorder and improve the initial
retrieval results by making use of the contextual information from semantically
neighboring samples. Prevailing re-ranking approaches update distance metrics
and mostly rely on inefficient crosscheck set comparison operations while
computing expanded neighbors based distances. In this work, we present an
efficient re-ranking method which refines initial retrieval results by updating
features. Specifically, we reformulate re-ranking based on Graph Convolution
Networks (GCN) and propose a novel Graph Convolution based Re-ranking (GCR) for
visual retrieval tasks via feature propagation. To accelerate computation for
large-scale retrieval, a decentralized and synchronous feature propagation
algorithm which supports parallel or distributed computing is introduced. In
particular, the plain GCR is extended for cross-camera retrieval and an
improved feature propagation formulation is presented to leverage affinity
relationships across different cameras. It is also extended for video-based
retrieval, and Graph Convolution based Re-ranking for Video (GCRV) is proposed
by mathematically deriving a novel profile vector generation method for the
tracklet. Without bells and whistles, the proposed approaches achieve
state-of-the-art performances on seven benchmark datasets from three different
tasks, i.e., image retrieval, person Re-ID and video-based person Re-ID.
Authors' comments: Code is publicly available:
https://github.com/WesleyZhang1991/GCN_rerank
Jiongnan Liu, Jiajie Jin, Zihan Wang, Jiehan Cheng, Zhicheng Dou, Ji-Rong Wen
Although Large Language Models (LLMs) have demonstrated extraordinary
capabilities in many domains, they still have a tendency to hallucinate and
generate fictitious responses to user requests. This problem can be alleviated
by augmenting LLMs with information retrieval (IR) systems (also known as
retrieval-augmented LLMs). Applying this strategy, LLMs can generate more
factual texts in response to user input according to the relevant content
retrieved by IR systems from external corpora as references. In addition, by
incorporating external knowledge, retrieval-augmented LLMs can answer in-domain
questions that cannot be answered by solely relying on the world knowledge
stored in parameters. To support research in this area and facilitate the
development of retrieval-augmented LLM systems, we develop RETA-LLM, a
{RET}reival-{A}ugmented LLM toolkit. In RETA-LLM, we create a complete pipeline
to help researchers and users build their customized in-domain LLM-based
systems. Compared with previous retrieval-augmented LLM systems, RETA-LLM
provides more plug-and-play modules to support better interaction between IR
systems and LLMs, including {request rewriting, document retrieval, passage
extraction, answer generation, and fact checking} modules. Our toolkit is
publicly available at https://github.com/RUC-GSAI/YuLan-IR/tree/main/RETA-LLM.
Authors' comments: Technical Report for RETA-LLM
Tan-Sang Ha, Hai Nguyen-Truong, Tuan-Anh Vu, Sai-Kit Yeung
Building a video retrieval system that is robust and reliable, especially for
the marine environment, is a challenging task due to several factors such as
dealing with massive amounts of dense and repetitive data, occlusion,
blurriness, low lighting conditions, and abstract queries. To address these
challenges, we present MarineVRS, a novel and flexible video retrieval system
designed explicitly for the marine domain. MarineVRS integrates
state-of-the-art methods for visual and linguistic object representation to
enable efficient and accurate search and analysis of vast volumes of underwater
video data. In addition, unlike the conventional video retrieval system, which
only permits users to index a collection of images or videos and search using a
free-form natural language sentence, our retrieval system includes an
additional Explainability module that outputs the segmentation masks of the
objects that the input query referred to. This feature allows users to identify
and isolate specific objects in the video footage, leading to more detailed
analysis and understanding of their behavior and movements. Finally, with its
adaptability, explainability, accuracy, and scalability, MarineVRS is a
powerful tool for marine researchers and scientists to efficiently and
accurately process vast amounts of data and gain deeper insights into the
behavior and movements of marine species.
Authors' comments: Accepted to OCEANS 2023 Limerick. Website:
https://marinevrs.hkustvgd.com/
Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim
Text-video retrieval contains various challenges, including biases coming
from diverse sources. We highlight some of them supported by illustrations to
open a discussion. Besides, we address one of the biases, frame length bias,
with a simple method which brings a very incremental but promising increase. We
conclude with future directions.
Authors' comments: 4 pages, CVPR 2023 Joint Ego4D&EPIC Workshop, Extended Abstract
Hamideh Kerdegari, Tran Huy Nhat Phung1, Van Hao Nguyen, Thi Phuong Thao Truong, Ngoc Minh Thu Le, Thanh Phuong Le, Thi Mai Thao Le, Luigi Pisani et al.
Skeletal muscle atrophy is a common occurrence in critically ill patients in
the intensive care unit (ICU) who spend long periods in bed. Muscle mass must
be recovered through physiotherapy before patient discharge and ultrasound
imaging is frequently used to assess the recovery process by measuring the
muscle size over time. However, these manual measurements are subject to large
variability, particularly since the scans are typically acquired on different
days and potentially by different operators. In this paper, we propose a
self-supervised contrastive learning approach to automatically retrieve similar
ultrasound muscle views at different scan times. Three different models were
compared using data from 67 patients acquired in the ICU. Results indicate that
our contrastive model outperformed a supervised baseline model in the task of
view retrieval with an AUC of 73.52% and when combined with an automatic
segmentation model achieved 5.7%+/-0.24% error in cross-sectional area.
Furthermore, a user study survey confirmed the efficacy of our model for muscle
view retrieval.
Authors' comments: 10 pages, 6 figures
Rishikesh Jha, Siddharth Subramaniyam, Ethan Benjamin, Thrivikrama Taula
Embedding-based neural retrieval is a prevalent approach to address the
semantic gap problem which often arises in product search on tail queries. In
contrast, popular queries typically lack context and have a broad intent where
additional context from users historical interaction can be helpful. In this
paper, we share our novel approach to address both: the semantic gap problem
followed by an end to end trained model for personalized semantic retrieval. We
propose learning a unified embedding model incorporating graph, transformer and
term-based embeddings end to end and share our design choices for optimal
tradeoff between performance and efficiency. We share our learnings in feature
engineering, hard negative sampling strategy, and application of transformer
model, including a novel pre-training strategy and other tricks for improving
search relevance and deploying such a model at industry scale. Our personalized
retrieval model significantly improves the overall search experience, as
measured by a 5.58% increase in search purchase rate and a 2.63% increase in
site-wide conversion rate, aggregated across multiple A/B tests - on live
traffic.
Authors' comments: To appear at FMLDS 2024
Yibin Lei, Liang Ding, Yu Cao, Changtong Zan, Andrew Yates, Dacheng Tao
Dense retrievers have achieved impressive performance, but their demand for
abundant training data limits their application scenarios. Contrastive
pre-training, which constructs pseudo-positive examples from unlabeled data,
has shown great potential to solve this problem. However, the pseudo-positive
examples crafted by data augmentations can be irrelevant. To this end, we
propose relevance-aware contrastive learning. It takes the intermediate-trained
model itself as an imperfect oracle to estimate the relevance of positive pairs
and adaptively weighs the contrastive loss of different pairs according to the
estimated relevance. Our method consistently improves the SOTA unsupervised
Contriever model on the BEIR and open-domain QA retrieval benchmarks. Further
exploration shows that our method can not only beat BM25 after further
pre-training on the target corpus but also serves as a good few-shot learner.
Our code is publicly available at https://github.com/Yibin-Lei/ReContriever.
Authors' comments: ACL 2023 Findings (Short), 5 pages main + 1 page references + 1 page
appendix