Xun Jiang, Zailei Zhou, Xing Xu, Yang Yang, Guoqing Wang, Heng Tao Shen
Video Moment Retrieval (VMR) aims at retrieving the most relevant events from
an untrimmed video with natural language queries. Existing VMR methods suffer
from two defects: (1) massive expensive temporal annotations are required to
obtain satisfying performance; (2) complicated cross-modal interaction modules
are deployed, which lead to high computational cost and low efficiency for the
retrieval process. To address these issues, we propose a novel method termed
Cheaper and Faster Moment Retrieval (CFMR), which well balances the retrieval
accuracy, efficiency, and annotation cost for VMR. Specifically, our proposed
CFMR method learns from point-level supervision where each annotation is a
single frame randomly located within the target moment. It is 6 times cheaper
than the conventional annotations of event boundaries. Furthermore, we also
design a concept-based multimodal alignment mechanism to bypass the usage of
cross-modal interaction modules during the inference process, remarkably
improving retrieval efficiency. The experimental results on three widely used
VMR benchmarks demonstrate the proposed CFMR method establishes new
state-of-the-art with point-level supervision. Moreover, it significantly
accelerates the retrieval speed with more than 100 times FLOPs compared to
existing approaches with point-level supervision.
Authors' comments: 10 pages, 7 figures
Mitradeep Sarkar, Michael T. Enders, Mehrdad Shokooh-Saremi, Kenji Watanabe, Takashi Taniguchi, Hanan Herzig Sheinfux, Frank H. L. Koppens, Georgia Theano Papadakis
High-quality low-dimensional layered and van der Waals materials are
typically exfoliated, with sample cross sectional areas on the order of tens to
hundreds of microns. The small size of flakes makes the experimental
characterization of their dielectric properties unsuitable with conventional
spectroscopic ellipsometry, due to beam-sample size mismatch and
non-uniformities of the crystal axes. Previously, the experimental measurement
of the dielectrirc permittivity of such microcrystals was carried out with
near-field tip-based scanning probes. These measurements are sensitive to
external conditions like vibrations and temperature, and require
non-deterministic numerical fitting to some a priori known model. We present an
alternative method to extract the in-plane dielectric permittivity of van der
Waals microcrystals, based on identifying reflectance minima in spectroscopic
measurements. Our method does not require complex fitting algorithms nor near
field tip-based measurements and accommodates for small-area samples. We
demonstrate the robustness of our method using hexagonal boron nitride and
{\alpha}-MoO3, and recover their dielectric permittivities that are close to
literature values.
Authors' comments: 10 pages, 4 figure and 3 tables
Fanjie Kong, Shuai Yuan, Weituo Hao, Ricardo Henao
We address the challenge of generating fair and unbiased image retrieval results given neutral textual queries (with no explicit gender or race connotations), while maintaining the utility (performance) of the underlying vision-language (VL) model. Previous methods aim to disentangle learned representations of images and text queries from gender and racial characteristics. However, we show these are inadequate at alleviating bias for the desired equal representation result, as there usually exists test-time bias in the target retrieval set. So motivated, we introduce a straightforward technique, Post-hoc Bias Mitigation (PBM), that post-processes the outputs from the pre-trained vision-language model. We evaluate our algorithm on real-world image search datasets, Occupation 1 and 2, as well as two large-scale image-text datasets, MS-COCO and Flickr30k. Our approach achieves the lowest bias, compared with various existing bias-mitigation methods, in text-based image retrieval result while maintaining satisfactory retrieval performance. The source code is publicly available at \url{https://anonymous.4open.science/r/Fair_Text_based_Image_Retrieval-D8B2}.
Kexin Wang, Nils Reimers, Iryna Gurevych
The work of neural retrieval so far focuses on ranking short texts and is challenged with long documents. There are many cases where the users want to find a relevant passage within a long document from a huge corpus, e.g. Wikipedia articles, research papers, etc. We propose and name this task \emph{Document-Aware Passage Retrieval} (DAPR). While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5\%) are due to missing document context. This drives us to build a benchmark for this task including multiple datasets from heterogeneous domains. In the experiments, we extend the SoTA passage retrievers with document context via (1) hybrid retrieval with BM25 and (2) contextualized passage representations, which inform the passage representation with document context. We find despite that hybrid retrieval performs the strongest on the mixture of the easy and the hard queries, it completely fails on the hard queries that require document-context understanding. On the other hand, contextualized passage representations (e.g. prepending document titles) achieve good improvement on these hard queries, but overall they also perform rather poorly. Our created benchmark enables future research on developing and comparing retrieval systems for the new task. The code and the data are available at https://github.com/UKPLab/arxiv2023-dapr.
Pha Nguyen, Kha Gia Quach, Kris Kitani, Khoa Luu
One of the recent trends in vision problems is to use natural language
captions to describe the objects of interest. This approach can overcome some
limitations of traditional methods that rely on bounding boxes or category
annotations. This paper introduces a novel paradigm for Multiple Object
Tracking called Type-to-Track, which allows users to track objects in videos by
typing natural language descriptions. We present a new dataset for that
Grounded Multiple Object Tracking task, called GroOT, that contains videos with
various types of objects and their corresponding textual captions describing
their appearance and action in detail. Additionally, we introduce two new
evaluation protocols and formulate evaluation metrics specifically for this
task. We develop a new efficient method that models a transformer-based
eMbed-ENcoDE-extRact framework (MENDER) using the third-order tensor
decomposition. The experiments in five scenarios show that our MENDER approach
outperforms another two-stage design in terms of accuracy and efficiency, up to
14.7% accuracy and 4$\times$ speed faster.
Authors' comments: Accepted at NeurIPS 2023. Project page:
https://uark-cviu.github.io/Type-to-Track/
Yucheng Cai, Hong Liu, Zhijian Ou, Yi Huang, Junlan Feng
Most existing task-oriented dialog (TOD) systems track dialog states in terms
of slots and values and use them to query a database to get relevant knowledge
to generate responses. In real-life applications, user utterances are noisier,
and thus it is more difficult to accurately track dialog states and correctly
secure relevant knowledge. Recently, a progress in question answering and
document-grounded dialog systems is retrieval-augmented methods with a
knowledge retriever. Inspired by such progress, we propose a retrieval-based
method to enhance knowledge selection in TOD systems, which significantly
outperforms the traditional database query method for real-life dialogs.
Further, we develop latent variable model based semi-supervised learning, which
can work with the knowledge retriever to leverage both labeled and unlabeled
dialog data. Joint Stochastic Approximation (JSA) algorithm is employed for
semi-supervised model training, and the whole system is referred to as that
JSA-KRTOD. Experiments are conducted on a real-life dataset from China Mobile
Custom-Service, called MobileCS, and show that JSA-KRTOD achieves superior
performances in both labeled-only and semi-supervised settings.
Authors' comments: 5 pages, accepted by INTERSPEECH2023
Jinheon Baek, Alham Fikri Aji, Jens Lehmann, Sung Ju Hwang
There has been a surge of interest in utilizing Knowledge Graphs (KGs) for
various natural language processing/understanding tasks. The conventional
mechanism to retrieve facts in KGs usually involves three steps: entity span
detection, entity disambiguation, and relation classification. However, this
approach requires additional labels for training each of the three
subcomponents in addition to pairs of input texts and facts, and also may
accumulate errors propagated from failures in previous steps. To tackle these
limitations, we propose a simple knowledge retrieval framework, which directly
retrieves facts from the KGs given the input text based on their
representational similarities, which we refer to as Direct Fact Retrieval
(DiFaR). Specifically, we first embed all facts in KGs onto a dense embedding
space by using a language model trained by only pairs of input texts and facts,
and then provide the nearest facts in response to the input text. Since the
fact, consisting of only two entities and one relation, has little context to
encode, we propose to further refine ranks of top-k retrieved facts with a
reranker that contextualizes the input text and the fact jointly. We validate
our DiFaR framework on multiple fact retrieval tasks, showing that it
significantly outperforms relevant baselines that use the three-step approach.
Authors' comments: ACL 2023
Revanth Gangi Reddy, Pradeep Dasigi, Md Arafat Sultan, Arman Cohan, Avirup Sil, Heng Ji, Hannaneh Hajishirzi
Neural information retrieval often adopts a retrieve-and-rerank framework: a
bi-encoder network first retrieves K (e.g., 100) candidates that are then
re-ranked using a more powerful cross-encoder model to rank the better
candidates higher. The re-ranker generally produces better candidate scores
than the retriever, but is limited to seeing only the top K retrieved
candidates, thus providing no improvements in retrieval performance as measured
by Recall@K. In this work, we leverage the re-ranker to also improve retrieval
by providing inference-time relevance feedback to the retriever. Concretely, we
update the retriever's query representation for a test instance using a
lightweight inference-time distillation of the re-ranker's prediction for that
instance. The distillation loss is designed to bring the retriever's candidate
scores closer to those of the re-ranker. A second retrieval step is then
performed with the updated query vector. We empirically show that our approach,
which can serve arbitrary retrieve-and-rerank pipelines, significantly improves
retrieval recall in multiple domains, languages, and modalities.
Authors' comments: Preprint
Ronak Pradeep, Kai Hui, Jai Gupta, Adam D. Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, Vinh Q. Tran
Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first empirical study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. We uncover several findings about scaling generative retrieval to millions of passages; notably, the central importance of using synthetic queries as document representations during indexing, the ineffectiveness of existing proposed architecture modifications when accounting for compute cost, and the limits of naively scaling model parameters with respect to retrieval performance. While we find that generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge. We believe these findings will be valuable for the community to clarify the current state of generative retrieval, highlight the unique challenges, and inspire new research directions.
Jifan Chen, Grace Kim, Aniruddh Sriram, Greg Durrett, Eunsol Choi
Evidence retrieval is a core part of automatic fact-checking. Prior work makes simplifying assumptions in retrieval that depart from real-world use cases: either no access to evidence, access to evidence curated by a human fact-checker, or access to evidence available long after the claim has been made. In this work, we present the first fully automated pipeline to check real-world claims by retrieving raw evidence from the web. We restrict our retriever to only search documents available prior to the claim's making, modeling the realistic scenario where an emerging claim needs to be checked. Our pipeline includes five components: claim decomposition, raw document retrieval, fine-grained evidence retrieval, claim-focused summarization, and veracity judgment. We conduct experiments on complex political claims in the ClaimDecomp dataset and show that the aggregated evidence produced by our pipeline improves veracity judgments. Human evaluation finds the evidence summary produced by our system is reliable (it does not hallucinate information) and relevant to answering key questions about a claim, suggesting that it can assist fact-checkers even when it cannot surface a complete evidence set.
Bhanu Prakash Voutharoja, Peng Wang, Lei Wang, Vivienne Guan
Image-to-recipe retrieval is a challenging vision-to-language task of
significant practical value. The main challenge of the task lies in the
ultra-high redundancy in the long recipe and the large variation reflected in
both food item combination and food item appearance. A de-facto idea to address
this task is to learn a shared feature embedding space in which a food image is
aligned better to its paired recipe than other recipes. However, such
supervised global matching is prone to supervision collapse, i.e., only partial
information that is necessary for distinguishing training pairs can be
identified, while other information that is potentially useful in
generalization could be lost. To mitigate such a problem, we propose a
mask-augmentation-based local matching network (MALM), where an image-text
matching module and a masked self-distillation module benefit each other
mutually to learn generalizable cross-modality representations. On one hand, we
perform local matching between the tokenized representations of image and text
to locate fine-grained cross-modality correspondence explicitly. We involve
representations of masked image patches in this process to alleviate
overfitting resulting from local matching especially when some food items are
underrepresented. On the other hand, predicting the hidden representations of
the masked patches through self-distillation helps to learn general-purpose
image representations that are expected to generalize better. And the
multi-task nature of the model enables the representations of masked patches to
be text-aware and thus facilitates the lost information reconstruction.
Experimental results on Recipe1M dataset show our method can clearly outperform
state-of-the-art (SOTA) methods. Our code will be available at
https://github.com/MyFoodChoice/MALM_Mask_Augmentation_based_Local_Matching-_for-_Food_Recipe_Retrieval
Authors' comments: Under review. Link to the dataset repo -
https://github.com/torralba-lab/im2recipe-Pytorch#recipe1m-dataset
Ruiyang Ren, Wayne Xin Zhao, Jing Liu, Hua Wu, Ji-Rong Wen, Haifeng Wang
Recently, model-based retrieval has emerged as a new paradigm in text
retrieval that discards the index in the traditional retrieval model and
instead memorizes the candidate corpora using model parameters. This design
employs a sequence-to-sequence paradigm to generate document identifiers, which
enables the complete capture of the relevance between queries and documents and
simplifies the classic indexretrieval-rerank pipeline. Despite its attractive
qualities, there remain several major challenges in model-based retrieval,
including the discrepancy between pre-training and fine-tuning, and the
discrepancy between training and inference. To deal with the above challenges,
we propose a novel two-stage model-based retrieval approach called TOME, which
makes two major technical contributions, including the utilization of tokenized
URLs as identifiers and the design of a two-stage generation architecture. We
also propose a number of training strategies to deal with the training
difficulty as the corpus size increases. Extensive experiments and analysis on
MS MARCO and Natural Questions demonstrate the effectiveness of our proposed
approach, and we investigate the scaling laws of TOME by examining various
influencing factors.
Authors' comments: ACL 2023
Samuel Talkington, Santiago Grijalva
Phase retrieval is a prevalent problem in digital signal processing and
experimental physics that consists of estimating a complex signal from
magnitude measurements. This paper expands the classical phase retrieval
framework to electric power systems with unknown network models and limited
access to observations of voltage magnitudes, active power injections, and
reactive power injections. The proposed method recovers the phase angles and
the power-phase angle submatrices of the AC power flow Jacobian matrix. This is
made possible by deriving topology and parameter-free expressions for the
structural symmetries of the power flow Jacobian that do not depend on the
phase angles. These physical laws provide structural constraints for the
proposed phase retrieval method. The paper then presents sufficient conditions
for guaranteed recovery of the voltage phase angles, which also depend solely
on voltage magnitudes, active power injections, and reactive power injections.
The method offers two significant benefits: both estimating the voltage phase
angles and recovering the power flow Jacobian matrix, a basis for approximating
the power flow equations. Simulations on widely studied open-source test
networks validate the findings.
Authors' comments: The 14th ACM International Conference on Future Energy Systems (ACM
e-Energy 2023)
Han Fang, Zhifei Yang, Xianghao Zang, Chao Ban, Hao Sun
Recently, masked video modeling has been widely explored and significantly improved the model's understanding ability of visual regions at a local level. However, existing methods usually adopt random masking and follow the same reconstruction paradigm to complete the masked regions, which do not leverage the correlations between cross-modal content. In this paper, we present Mask for Semantics Completion (MASCOT) based on semantic-based masked modeling. Specifically, after applying attention-based video masking to generate high-informed and low-informed masks, we propose Informed Semantics Completion to recover masked semantics information. The recovery mechanism is achieved by aligning the masked content with the unmasked visual regions and corresponding textual context, which makes the model capture more text-related details at a patch level. Additionally, we shift the emphasis of reconstruction from irrelevant backgrounds to discriminative parts to ignore regions with low-informed masks. Furthermore, we design dual-mask co-learning to incorporate video cues under different masks and learn more aligned video representation. Our MASCOT performs state-of-the-art performance on four major text-video retrieval benchmarks, including MSR-VTT, LSMDC, ActivityNet, and DiDeMo. Extensive ablation studies demonstrate the effectiveness of the proposed schemes.
Zongyu Li, Jason Hu, Xiaojian Xu, Liyue Shen, Jeffrey A. Fessler
Phase retrieval (PR) is a crucial problem in many imaging applications. This study focuses on resolving the holographic phase retrieval problem in situations where the measurements are affected by a combination of Poisson and Gaussian noise, which commonly occurs in optical imaging systems. To address this problem, we propose a new algorithm called "AWFS" that uses the accelerated Wirtinger flow (AWF) with a score function as generative prior. Specifically, we formulate the PR problem as an optimization problem that incorporates both data fidelity and regularization terms. We calculate the gradient of the log-likelihood function for PR and determine its corresponding Lipschitz constant. Additionally, we introduce a generative prior in our regularization framework by using score matching to capture information about the gradient of image prior distributions. We provide theoretical analysis that establishes a critical-point convergence guarantee for the proposed algorithm. The results of our simulation experiments on three different datasets show the following: 1) By using the PG likelihood model, the proposed algorithm improves reconstruction compared to algorithms based solely on Gaussian or Poisson likelihood. 2) The proposed score-based image prior method, performs better than the method based on denoising diffusion probabilistic model (DDPM), as well as plug-and-play alternating direction method of multipliers (PnP-ADMM) and regularization by denoising (RED).
Sarah A. Obead, Hsuan-Yin Lin, Eirik Rosnes
We study the problem of pliable private information retrieval with side
information (PPIR-SI) for the single server case. In PPIR, the messages are
partitioned into nonoverlapping classes and stored in a number of noncolluding
databases. The user wishes to retrieve any one message from a desired class
while revealing no information about the desired class identity to the
databases. In PPIR-SI, the user has prior access to some side information in
the form of messages from different classes and wishes to retrieve any one new
message from a desired class, i.e., the message is not included in the side
information set, while revealing no information about the desired class to the
databases. We characterize the capacity of (linear) single-server PPIR-SI for
the case where the user's side information is unidentified, i.e., the user is
oblivious of the identities of its side information messages and the database
structure. We term this case PPIR-USI. Surprisingly, we show that having side
information, in PPIR-USI, is disadvantageous, in terms of the download rate,
compared to PPIR.
Authors' comments: 9 pages, 3 figures, 1 table. An extended version of a paper accepted
for presentation at the 2023 IEEE International Symposium on Information
Theory (ISIT)
Odunayo Ogundepo, Tajuddeen R. Gwadabe, Clara E. Rivera, Jonathan H. Clark, Sebastian Ruder, David Ifeoluwa Adelani, Bonaventure F. P. Dossou, Abdou Aziz DIOP et al.
African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology.
Richard Luo, Austin Peng, Heidi Yap, Koby Beard
Video summarization has become an increasingly important task in the field of computer vision due to the vast amount of video content available on the internet. In this project, we propose a new method for natural language query based joint video summarization and highlight detection using multi-modal transformers. This approach will use both visual and audio cues to match a user's natural language query to retrieve the most relevant and interesting moments from a video. Our approach employs multiple recent techniques used in Vision Transformers (ViTs) to create a transformer-like encoder-decoder model. We evaluated our approach on multiple datasets such as YouTube Highlights and TVSum to demonstrate the flexibility of our proposed method.
Rikimaru Kurata, Keiichiro Toda, Genki Ishigane, Makoto Naruse, Ryoichi Horisaki, Takuro Ideguchi
We present a single-image numerical phase retrieval method for Zernike phase-contrast microscopy (ZPM) that addresses halo and shade-off artifacts, as well as the weak phase condition, without requiring hardware modifications. By employing a rigorous physical model of ZPM and a gradient descent algorithm for its inversion, we achieve quantitative ZPM imaging. Our approach is experimentally validated using biological cells and its quantitative nature is confirmed through comparisons with digital holography observations.
Sheng Yan, Yang Liu, Haoqiang Wang, Xin Du, Mengyuan Liu, Hong Liu
Cross-modal retrieval of image-text and video-text is a prominent research
area in computer vision and natural language processing. However, there has
been insufficient attention given to cross-modal retrieval between human motion
and text, despite its wide-ranging applicability. To address this gap, we
utilize a concise yet effective dual-unimodal transformer encoder for tackling
this task. Recognizing that overlapping atomic actions in different human
motion sequences can lead to semantic conflicts between samples, we explore a
novel triplet loss function called DropTriple Loss. This loss function discards
false negative samples from the negative sample set and focuses on mining
remaining genuinely hard negative samples for triplet training, thereby
reducing violations they cause. We evaluate our model and approach on the
HumanML3D and KIT Motion-Language datasets. On the latest HumanML3D dataset, we
achieve a recall of 62.9% for motion retrieval and 71.5% for text retrieval
(both based on R@10). The source code for our approach is publicly available at
https://github.com/eanson023/rehamot.
Authors' comments: This paper is accepted by ACM MM Asia 2023