Zhe Li, Weihao Yuan, Yisheng He, Lingteng Qiu, Shenhao Zhu, Xiaodong Gu, Weichao Shen, Yuan Dong et al.
Language plays a vital role in the realm of human motion. Existing methods have largely depended on CLIP text embeddings for motion generation, yet they fall short in effectively aligning language and motion due to CLIP's pretraining on static image-text pairs. This work introduces LaMP, a novel Language-Motion Pretraining model, which transitions from a language-vision to a more suitable language-motion latent space. It addresses key limitations by generating motion-informative text embeddings, significantly enhancing the relevance and semantics of generated motion sequences. With LaMP, we advance three key tasks: text-to-motion generation, motion-text retrieval, and motion captioning through aligned language-motion representation learning. For generation, we utilize LaMP to provide the text condition instead of CLIP, and an autoregressive masked prediction is designed to achieve mask modeling without rank collapse in transformers. For retrieval, motion features from LaMP's motion transformer interact with query tokens to retrieve text features from the text transformer, and vice versa. For captioning, we finetune a large language model with the language-informative motion features to develop a strong motion captioning model. In addition, we introduce the LaMP-BertScore metric to assess the alignment of generated motions with textual descriptions. Extensive experimental results on multiple datasets demonstrate substantial improvements over previous methods across all three tasks. The code of our method will be made public.
Mohammad Omama, Po-han Li, Sandeep P. Chinchali
Image retrieval is crucial in robotics and computer vision, with downstream applications in robot place recognition and vision-based product recommendations. Modern retrieval systems face two key challenges: scalability and efficiency. State-of-the-art image retrieval systems train specific neural networks for each dataset, an approach that lacks scalability. Furthermore, since retrieval speed is directly proportional to embedding size, existing systems that use large embeddings lack efficiency. To tackle scalability, recent works propose using off-the-shelf foundation models. However, these models, though applicable across datasets, fall short in achieving performance comparable to that of dataset-specific models. Our key observation is that, while foundation models capture necessary subtleties for effective retrieval, the underlying distribution of their embedding space can negatively impact cosine similarity searches. We introduce Autoencoders with Strong Variance Constraints (AE-SVC), which, when used for projection, significantly improves the performance of foundation models. We provide an in-depth theoretical analysis of AE-SVC. Addressing efficiency, we introduce Single-shot Similarity Space Distillation ((SS)$_2$D), a novel approach to learn embeddings with adaptive sizes that offers a better trade-off between size and performance. We conducted extensive experiments on four retrieval datasets, including Stanford Online Products (SoP) and Pittsburgh30k, using four different off-the-shelf foundation models, including DinoV2 and CLIP. AE-SVC demonstrates up to a $16\%$ improvement in retrieval performance, while (SS)$_2$D shows a further $10\%$ improvement for smaller embedding sizes.
Thomas Schmied, Fabian Paischer, Vihang Patil, Markus Hofmarcher, Razvan Pascanu, Sepp Hochreiter
In-context learning (ICL) is the ability of a model to learn a new task by observing a few exemplars in its context. While prevalent in NLP, this capability has recently also been observed in Reinforcement Learning (RL) settings. Prior in-context RL methods, however, require entire episodes in the agent's context. Given that complex environments typically lead to long episodes with sparse rewards, these methods are constrained to simple environments with short episodes. To address these challenges, we introduce Retrieval-Augmented Decision Transformer (RA-DT). RA-DT employs an external memory mechanism to store past experiences from which it retrieves only sub-trajectories relevant for the current situation. The retrieval component in RA-DT does not require training and can be entirely domain-agnostic. We evaluate the capabilities of RA-DT on grid-world environments, robotics simulations, and procedurally-generated video games. On grid-worlds, RA-DT outperforms baselines, while using only a fraction of their context length. Furthermore, we illuminate the limitations of current in-context RL methods on complex environments and discuss future directions. To facilitate future research, we release datasets for four of the considered environments.
Jakub Pokrywka, Piotr Wierzchoń, Kornel Weryszko, Krzysztof Jassem
Multimodal models, which combine visual and textual information, have recently gained significant recognition. This paper addresses the multimodal challenge of Text-Image retrieval and introduces a novel task that extends the modalities to include temporal data. The Temporal Image Caption Retrieval Competition (TICRC) presented in this paper is based on the Chronicling America and Challenging America projects, which offer access to an extensive collection of digitized historic American newspapers spanning 274 years. In addition to the competition results, we provide an analysis of the delivered dataset and the process of its creation.
Linping Zhang, Yu Liu, Xueqian Wang, Gang Li, You He
With the improvement in the quantity and quality of remote sensing images, content-based remote sensing object retrieval (CBRSOR) has become an increasingly important topic. However, existing CBRSOR methods neglect the utilization of global statistical information during both training and test stages, which leads to the overfitting of neural networks to simple sample pairs of samples during training and suboptimal metric performance. Inspired by the Neyman-Pearson theorem, we propose a generalized likelihood ratio test-based metric learning (GLRTML) approach, which can estimate the relative difficulty of sample pairs by incorporating global data distribution information during training and test phases. This guides the network to focus more on difficult samples during the training process, thereby encourages the network to learn more discriminative feature embeddings. In addition, GLRT is a more effective than traditional metric space due to the utilization of global data distribution information. Accurately estimating the distribution of embeddings is critical for GLRTML. However, in real-world applications, there is often a distribution shift between the training and target domains, which diminishes the effectiveness of directly using the distribution estimated on training data. To address this issue, we propose the clustering pseudo-labels-based fast parameter adaptation (CPLFPA) method. CPLFPA efficiently estimates the distribution of embeddings in the target domain by clustering target domain instances and re-estimating the distribution parameters for GLRTML. We reorganize datasets for CBRSOR tasks based on fine-grained ship remote sensing image slices (FGSRSI-23) and military aircraft recognition (MAR20) datasets. Extensive experiments on these datasets demonstrate the effectiveness of our proposed GLRTML and CPLFPA.
Hang Guo, Tao Dai, Zhihao Ouyang, Taolin Zhang, Yaohua Zha, Bin Chen, Shu-tao Xia
Recent advances in diffusion-based Large Restoration Models (LRMs) have
significantly improved photo-realistic image restoration by leveraging the
internal knowledge embedded within model weights. However, existing LRMs often
suffer from the hallucination dilemma, i.e., producing incorrect contents or
textures when dealing with severe degradations, due to their heavy reliance on
limited internal knowledge. In this paper, we propose an orthogonal solution
called the Retrieval-augmented Framework for Image Restoration (ReFIR), which
incorporates retrieved images as external knowledge to extend the knowledge
boundary of existing LRMs in generating details faithful to the original scene.
Specifically, we first introduce the nearest neighbor lookup to retrieve
content-relevant high-quality images as reference, after which we propose the
cross-image injection to modify existing LRMs to utilize high-quality textures
from retrieved images. Thanks to the additional external knowledge, our ReFIR
can well handle the hallucination challenge and facilitate faithfully results.
Extensive experiments demonstrate that ReFIR can achieve not only high-fidelity
but also realistic restoration results. Importantly, our ReFIR requires no
training and is adaptable to various LRMs.
Authors' comments: Accepted by NeurIPS 2024
Rima Alaifari, Yunan Yang
Phase retrieval from phaseless short-time Fourier transform (STFT)
measurements is known to be inherently unstable when measurements are taken
with respect to a single window. While an explicit inversion formula exists, it
is useless in practice due to its instability. In this paper, we overcome this
lack of stability by presenting two multi-window approaches that rely on a
"good coverage" of the time-frequency plane by the ambiguity functions of the
windows. The first is to use the fractional Fourier transform of a dilated
Gauss function with various angles as window functions. The essential support
of a superposition of the ambiguity function from such window functions is of a
"daffodil shape", which converges to a large disc as more angles are used,
yielding a much broader coverage in the time-frequency domain. The second
approach uses Hermite functions of various degrees as the window functions. The
larger the degree, the wider the ambiguity function but with zeros on circles
in the time-frequency domain. Combining Hermite functions of different degrees,
we can achieve a wide coverage with zeros compensated by the essential support
of the ambiguity function from other Hermite windows. Taking advantage of these
multi-window procedures, we can stably perform STFT phase retrieval using the
direct inversion formula.
Authors' comments: 20 pages, 11 figures
Pengcheng Jiang, Cao Xiao, Minhao Jiang, Parminder Bhatia, Taha Kass-Hout, Jimeng Sun, Jiawei Han
Large language models (LLMs) have demonstrated significant potential in
clinical decision support. Yet LLMs still suffer from hallucinations and lack
fine-grained contextual medical knowledge, limiting their high-stake healthcare
applications such as clinical diagnosis. Traditional retrieval-augmented
generation (RAG) methods attempt to address these limitations but frequently
retrieve sparse or irrelevant information, undermining prediction accuracy. We
introduce KARE, a novel framework that integrates knowledge graph (KG)
community-level retrieval with LLM reasoning to enhance healthcare predictions.
KARE constructs a comprehensive multi-source KG by integrating biomedical
databases, clinical literature, and LLM-generated insights, and organizes it
using hierarchical graph community detection and summarization for precise and
contextually relevant information retrieval. Our key innovations include: (1) a
dense medical knowledge structuring approach enabling accurate retrieval of
relevant information; (2) a dynamic knowledge retrieval mechanism that enriches
patient contexts with focused, multi-faceted medical insights; and (3) a
reasoning-enhanced prediction framework that leverages these enriched contexts
to produce both accurate and interpretable clinical predictions. Extensive
experiments demonstrate that KARE outperforms leading models by up to
10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and
readmission predictions. In addition to its impressive prediction accuracy, our
framework leverages the reasoning capabilities of LLMs, enhancing the
trustworthiness of clinical predictions.
Authors' comments: under review
Aniruddh Sriram, Fangyuan Xu, Eunsol Choi, Greg Durrett
Recent work on fact-checking addresses a realistic setting where models
incorporate evidence retrieved from the web to decide the veracity of claims. A
bottleneck in this pipeline is in retrieving relevant evidence: traditional
methods may surface documents directly related to a claim, but fact-checking
complex claims requires more inferences. For instance, a document about how a
vaccine was developed is relevant to addressing claims about what it might
contain, even if it does not address them directly. We present Contrastive
Fact-Checking Reranker (CFR), an improved retriever for this setting. By
leveraging the AVeriTeC dataset, which annotates subquestions for claims with
human written answers from evidence documents, we fine-tune Contriever with a
contrastive objective based on multiple training signals, including
distillation from GPT-4, evaluating subquestion answers, and gold labels in the
dataset. We evaluate our model on both retrieval and end-to-end veracity
judgments about claims. On the AVeriTeC dataset, we find a 6\% improvement in
veracity classification accuracy. We also show our gains can be transferred to
FEVER, ClaimDecomp, HotpotQA, and a synthetic dataset requiring retrievers to
make inferences.
Authors' comments: EMNLP 2024 FEVER Workshop
Licheng Dai, Xiliang Lu, Juntao You
This paper investigates the sparse phase retrieval problem, which aims to recover a sparse signal from a system of quadratic measurements. In this work, we propose a novel non-convex algorithm, termed Gradient Hard Thresholding Pursuit (GraHTP), for sparse phase retrieval with complex sensing vectors. GraHTP is theoretically provable and exhibits high efficiency, achieving a quadratic convergence rate after a finite number of iterations, while maintaining low computational complexity per iteration. Numerical experiments further demonstrate GraHTP's superior performance compared to state-of-the-art algorithms.
Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang et al.
The scaling of inference computation has unlocked the potential of
long-context large language models (LLMs) across diverse settings. For
knowledge-intensive tasks, the increased compute is often allocated to
incorporate more external knowledge. However, without effectively utilizing
such knowledge, solely expanding context does not always enhance performance.
In this work, we investigate inference scaling for retrieval augmented
generation (RAG), exploring the combination of multiple strategies beyond
simply increasing the quantity of knowledge, including in-context learning and
iterative prompting. These strategies provide additional flexibility to scale
test-time computation (e.g., by increasing retrieved documents or generation
steps), thereby enhancing LLMs' ability to effectively acquire and utilize
contextual information. We address two key questions: (1) How does RAG
performance benefit from the scaling of inference computation when optimally
configured? (2) Can we predict the optimal test-time compute allocation for a
given budget by modeling the relationship between RAG performance and inference
parameters? Our observations reveal that increasing inference computation leads
to nearly linear gains in RAG performance when optimally allocated, a
relationship we describe as the inference scaling laws for RAG. Building on
this, we further develop the computation allocation model to estimate RAG
performance across different inference configurations. The model predicts
optimal inference parameters under various computation constraints, which align
closely with the experimental results. By applying these optimal
configurations, we demonstrate that scaling inference compute on long-context
LLMs achieves up to 58.9% gains on benchmark datasets compared to standard RAG.
Authors' comments: ICLR 2025
Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, Haonan Li
As large language models (LLMs) advance, their inability to autonomously
execute tasks by directly interacting with external tools remains a critical
limitation. Traditional methods rely on inputting tool descriptions as context,
which is constrained by context length and requires separate, often
inefficient, retrieval mechanisms. We introduce ToolGen, a paradigm shift that
integrates tool knowledge directly into the LLM's parameters by representing
each tool as a unique token. This enables the LLM to generate tool calls and
arguments as part of its next token prediction capabilities, seamlessly
blending tool invocation with language generation. Our framework allows the LLM
to access and utilize a vast amount of tools with no additional retrieval step,
significantly enhancing both performance and scalability. Experimental results
with over 47,000 tools show that ToolGen not only achieves superior results in
both tool retrieval and autonomous task completion but also sets the stage for
a new era of AI agents that can adapt to tools across diverse domains. By
fundamentally transforming tool retrieval into a generative process, ToolGen
paves the way for more versatile, efficient, and autonomous AI systems. ToolGen
enables end-to-end tool learning and opens opportunities for integration with
other advanced techniques such as chain-of-thought and reinforcement learning,
thereby expanding the practical capabilities of LLMs.
Authors' comments: ICLR 2025
Jaewoo Lee, Joonho Ko, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang
Information Retrieval (IR) methods aim to identify relevant documents in
response to a given query, which have gained remarkable attention due to their
successful application in various natural language tasks. However, existing
approaches typically consider only the textual information within the
documents, which overlooks the fact that documents can contain multiple
modalities, including texts, images, and tables. Further, they often segment
each long document into multiple discrete passages for embedding, preventing
them from capturing the overall document context and interactions between
paragraphs. We argue that these two limitations lead to suboptimal document
representations for retrieval. In this work, to address them, we aim to produce
more comprehensive and nuanced document representations by holistically
embedding documents interleaved with different modalities. Specifically, we
achieve this by leveraging the capability of recent vision-language models that
enable the processing and integration of text, images, and tables into a
unified format and representation. Moreover, to mitigate the information loss
from segmenting documents into passages, instead of representing and retrieving
passages individually, we further merge the representations of segmented
passages into one single document representation, while we additionally
introduce a reranking strategy to decouple and identify the relevant passage
within the document if necessary. Then, through extensive experiments on
diverse information retrieval scenarios considering both the textual and
multimodal queries, we show that our approach substantially outperforms
relevant baselines, thanks to the consideration of the multimodal information
interleaved within the documents in a unified way.
Authors' comments: Preprint
Charbel Chucri, Rami Azouz, Joachim Ott
Recent retrieval-augmented models enhance basic methods by building a hierarchical structure over retrieved text chunks through recursive embedding, clustering, and summarization. The most relevant information is then retrieved from both the original text and generated summaries. However, such approaches face limitations with dynamic datasets, where adding or removing documents over time complicates the updating of hierarchical representations formed through clustering. We propose a new algorithm to efficiently maintain the recursive-abstractive tree structure in dynamic datasets, without compromising performance. Additionally, we introduce a novel post-retrieval method that applies query-focused recursive abstractive processing to substantially improve context quality. Our method overcomes the limitations of other approaches by functioning as a black-box post-retrieval layer compatible with any retrieval algorithm. Both algorithms are validated through extensive experiments on real-world datasets, demonstrating their effectiveness in handling dynamic data and improving retrieval performance.
Aleksandr Gordeev, Vladimir Dokholyan, Irina Tolstykh, Maksim Kuprashevich
Existing approaches for video moment retrieval and highlight detection are
not able to align text and video features efficiently, resulting in
unsatisfying performance and limited production usage. To address this, we
propose a novel architecture that utilizes recent foundational video models
designed for such alignment. Combined with the introduced Saliency-Guided Cross
Attention mechanism and a hybrid DETR architecture, our approach significantly
enhances performance in both moment retrieval and highlight detection tasks.
For even better improvement, we developed InterVid-MR, a large-scale and
high-quality dataset for pretraining. Using it, our architecture achieves
state-of-the-art results on the QVHighlights, Charades-STA and TACoS
benchmarks. The proposed approach provides an efficient and scalable solution
for both zero-shot and fine-tuning scenarios in video-language tasks.
Authors' comments: 8 pages, 1 figure, 4 tables
Francesc Net, Lluis Gomez
The intersection of Artificial Intelligence and Digital Humanities enables
researchers to explore cultural heritage collections with greater depth and
scale. In this paper, we present EUFCC-CIR, a dataset designed for Composed
Image Retrieval (CIR) within Galleries, Libraries, Archives, and Museums (GLAM)
collections. Our dataset is built on top of the EUFCC-340K image labeling
dataset and contains over 180K annotated CIR triplets. Each triplet is composed
of a multi-modal query (an input image plus a short text describing the desired
attribute manipulations) and a set of relevant target images. The EUFCC-CIR
dataset fills an existing gap in CIR-specific resources for Digital Humanities.
We demonstrate the value of the EUFCC-CIR dataset by highlighting its unique
qualities in comparison to other existing CIR datasets and evaluating the
performance of several zero-shot CIR baselines.
Authors' comments: ECCV Workshop (AI4DH2024)
Keyush Shah, Abhishek Goyal, Isaac Wasserman
Retrieval augmented generation (RAG) has become the standard in long context question answering (QA) systems. However, typical implementations of RAG rely on a rather naive retrieval mechanism, in which texts whose embeddings are most similar to that of the query are deemed most relevant. This has consequences in subjective QA tasks, where the most relevant text may not directly contain the answer. In this work, we propose a novel extension to RAG systems, which we call Retrieval from AI Derived Documents (RAIDD). RAIDD leverages the full power of the LLM in the retrieval process by deriving inferred features, such as summaries and example questions, from the documents at ingest. We demonstrate that this approach significantly improves the performance of RAG systems on long-context QA tasks.
Guangsheng Ma, Hongbo Li
Quantum computation has found greater efficiency and security across various fields. We show that, in a near-term hybrid cloud computing scenario with only one single quantum server and an entirely classical client, critical bottlenecks in privacy-preserving computation can be addressed. First, we propose an efficient quantum functional bootstrapping algorithm with a runtime polynomial in the plaintext-size, providing an exponential quantum speedup over classical algorithms. Second, we present a secure and fast quantum private information retrieval protocol with logarithmic query time. The security relies on the learning with errors (LWE) problem with polynomial modulus, greatly improving the security of classical fast PIR protocol based on ring-LWE with super-polynomial modulus. Technically, we extend an important classical homomorphic operation, known as blind rotation, to the quantum case by an encrypted conditional rotation technique. This technique holds promise for broader applications in quantum cryptography.
Yabing Wang, Le Wang, Qiang Zhou, Zhibin Wang, Hao Li, Gang Hua, Wei Tang
Cross-lingual cross-modal retrieval (CCR) aims to retrieve visually relevant
content based on non-English queries, without relying on human-labeled
cross-modal data pairs during training. One popular approach involves utilizing
machine translation (MT) to create pseudo-parallel data pairs, establishing
correspondence between visual and non-English textual data. However, aligning
their representations poses challenges due to the significant semantic gap
between vision and text, as well as the lower quality of non-English
representations caused by pre-trained encoders and data noise. To overcome
these challenges, we propose LECCR, a novel solution that incorporates the
multi-modal large language model (MLLM) to improve the alignment between visual
and non-English representations. Specifically, we first employ MLLM to generate
detailed visual content descriptions and aggregate them into multi-view
semantic slots that encapsulate different semantics. Then, we take these
semantic slots as internal features and leverage them to interact with the
visual features. By doing so, we enhance the semantic information within the
visual features, narrowing the semantic gap between modalities and generating
local visual semantics for subsequent multi-level matching. Additionally, to
further enhance the alignment between visual and non-English features, we
introduce softened matching under English guidance. This approach provides more
comprehensive and reliable inter-modal correspondences between visual and
non-English features. Extensive experiments on four CCR benchmarks, \ie
Multi30K, MSCOCO, VATEX, and MSR-VTT-CN, demonstrate the effectiveness of our
proposed method. Code: \url{https://github.com/LiJiaBei-7/leccr}.
Authors' comments: Accepted by ACM Multimedia
Bingqing Zhang, Zhuo Cao, Heming Du, Xin Yu, Xue Li, Jiajun Liu, Sen Wang
Text-Video Retrieval (TVR) methods typically match query-candidate pairs by aligning text and video features in coarse-grained, fine-grained, or combined (coarse-to-fine) manners. However, these frameworks predominantly employ a one(query)-to-one(candidate) alignment paradigm, which struggles to discern nuanced differences among candidates, leading to frequent mismatches. Inspired by Comparative Judgement in human cognitive science, where decisions are made by directly comparing items rather than evaluating them independently, we propose TokenBinder. This innovative two-stage TVR framework introduces a novel one-to-many coarse-to-fine alignment paradigm, imitating the human cognitive process of identifying specific items within a large collection. Our method employs a Focused-view Fusion Network with a sophisticated cross-attention mechanism, dynamically aligning and comparing features across multiple videos to capture finer nuances and contextual variations. Extensive experiments on six benchmark datasets confirm that TokenBinder substantially outperforms existing state-of-the-art methods. These results demonstrate its robustness and the effectiveness of its fine-grained alignment in bridging intra- and inter-modality information gaps in TVR tasks.