Diji Yang, Minghao Liu, Chung-Hsiang Lo, Yi Zhang, James Davis
Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind, that is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction.
Jing Liu, Haiye Huo
The classical phase retrieval refers to the recovery of an unknown signal
from its Fourier magnitudes, which is widely used in fields such as quantum
mechanics, signal processing, optics, etc. The offset linear canonical
transform (OLCT), which is a more general type of linear integral transform
including Fourier transform (FT), fractional Fourier transform (FrFT), and
linear canonical transform (LCT) as its special cases. Hence, in this paper, we
focus on the uniqueness problem of phase retrieval in the framework of OLCT.
First, we prove that all the nontrivial ambiguities in continuous OLCT phase
retrieval can be represented by convolution operators, and demonstrate that a
continuous compactly supported signal can be uniquely determined up to a global
phase from its multiple magnitude-only OLCT measurements. Moreover, we
investigate the nontrivial ambiguities in the discrete OLCT phase retrieval
case. Furthermore, we demenstrate that a nonseparable function can be uniquely
recovered from its magnitudes of short-time OLCT (STOLCT) up to a global phase.
Finally, we show that signals which are bandlimited in FT or OLCT domain can be
reconstructed from its sampled STOLCT magnitude measurements, up to a global
phase, providing the ambiguity function of window function satisfies some mild
conditions.
Authors' comments: 21 pages
Mengxi Xiao, Mang Ye, Ben Liu, Xiaofen Zong, He Li, Jimin Huang, Qianqian Xie, Min Peng
The application of AI in psychiatric diagnosis faces significant challenges,
including the subjective nature of mental health assessments, symptom overlap
across disorders, and privacy constraints limiting data availability. To
address these issues, we present MoodAngels, the first specialized multi-agent
framework for mood disorder diagnosis. Our approach combines granular-scale
analysis of clinical assessments with a structured verification process,
enabling more accurate interpretation of complex psychiatric data.
Complementing this framework, we introduce MoodSyn, an open-source dataset of
1,173 synthetic psychiatric cases that preserves clinical validity while
ensuring patient privacy. Experimental results demonstrate that MoodAngels
outperforms conventional methods, with our baseline agent achieving 12.3%
higher accuracy than GPT-4o on real-world cases, and our full multi-agent
system delivering further improvements. Evaluation in the MoodSyn dataset
demonstrates exceptional fidelity, accurately reproducing both the core
statistical patterns and complex relationships present in the original data
while maintaining strong utility for machine learning applications. Together,
these contributions provide both an advanced diagnostic tool and a critical
research resource for computational psychiatry, bridging important gaps in
AI-assisted mental health assessment.
Authors' comments: 40 pages, 11 figures
Xiwei Xu, Hans Weytjens, Dawen Zhang, Qinghua Lu, Ingo Weber, Liming Zhu
Recent studies show that 60% of LLM-based compound systems in enterprise environments leverage some form of retrieval-augmented generation (RAG), which enhances the relevance and accuracy of LLM (or other genAI) outputs by retrieving relevant information from external data sources. LLMOps involves the practices and techniques for managing the lifecycle and operations of LLM compound systems in production environments. It supports enhancing LLM systems through continuous operations and feedback evaluation. RAGOps extends LLMOps by incorporating a strong focus on data management to address the continuous changes in external data sources. This necessitates automated methods for evaluating and testing data operations, enhancing retrieval relevance and generation quality. In this paper, we (1) characterize the generic architecture of RAG applications based on the 4+1 model view for describing software architectures, (2) outline the lifecycle of RAG systems, which integrates the management lifecycles of both the LLM and the data, (3) define the key design considerations of RAGOps across different stages of the RAG lifecycle and quality trade-off analyses, (4) highlight the overarching research challenges around RAGOps, and (5) present two use cases of RAG applications and the corresponding RAGOps considerations.
Katherine Thai, Mohit Iyyer
How well do modern long-context language models understand literary fiction?
We explore this question via the task of literary evidence retrieval,
repurposing the RELiC dataset of That et al. (2022) to construct a benchmark
where the entire text of a primary source (e.g., The Great Gatsby) is provided
to an LLM alongside literary criticism with a missing quotation from that work.
This setting, in which the model must generate the missing quotation, mirrors
the human process of literary analysis by requiring models to perform both
global narrative reasoning and close textual examination. We curate a
high-quality subset of 292 examples through extensive filtering and human
verification. Our experiments show that recent reasoning models, such as Gemini
Pro 2.5 can exceed human expert performance (62.5% vs. 50% accuracy). In
contrast, the best open-weight model achieves only 29.1% accuracy, highlighting
a wide gap in interpretive reasoning between open and closed-weight models.
Despite their speed and apparent accuracy, even the strongest models struggle
with nuanced literary signals and overgeneration, signaling open challenges for
applying LLMs to literary analysis. We release our dataset and evaluation code
to encourage future work in this direction.
Authors' comments: ACL 2025
Yuxuan Wu, Le Wang, Sanping Zhou, Mengnan Liu, Gang Hua, Haoxiang Li
Controllable layout generation aims to create plausible visual arrangements
of element bounding boxes within a graphic design according to certain optional
constraints, such as the type or position of a specific component. While recent
diffusion or flow-matching models have achieved considerable advances in
multifarious conditional generation tasks, there remains considerable room for
generating optimal arrangements under given conditions. In this work, we
propose to carry out layout generation through retrieving by conditions and
reference-guided generation. Specifically, we retrieve appropriate layout
templates according to given conditions as references. The references are then
utilized to guide the denoising or flow-based transport process. By retrieving
layouts compatible with the given conditions, we can uncover the potential
information not explicitly provided in the given condition. Such an approach
offers more effective guidance to the model during the generation process, in
contrast to previous models that feed the condition to the model and let the
model infer the unprovided layout attributes directly. Meanwhile, we design a
condition-modulated attention that selectively absorbs retrieval knowledge,
adapting to the difference between retrieved templates and given conditions.
Extensive experiment results show that our method successfully produces
high-quality layouts that meet the given conditions and outperforms existing
state-of-the-art models. Code will be released upon acceptance.
Authors' comments: 12 pages, 5 figures
Yingying Zhuang, Aman Gupta, Anurag Beniwal
Multilingual information retrieval has emerged as powerful tools for
expanding knowledge sharing across languages. On the other hand, resources on
high quality knowledge base are often scarce and in limited languages,
therefore an effective embedding model to transform sentences from different
languages into a feature vector space same as the knowledge base language
becomes the key ingredient for cross language knowledge sharing, especially to
transfer knowledge available in high-resource languages to low-resource ones.
In this paper we propose a novel strategy to fine-tune multilingual embedding
models with weighted sampling for contrastive learning, enabling multilingual
information retrieval with a monolingual knowledge base. We demonstrate that
the weighted sampling strategy produces performance gains compared to standard
ones by up to 31.03\% in MRR and up to 33.98\% in Recall@3. Additionally, our
proposed methodology is language agnostic and applicable for both multilingual
and code switching use cases.
Authors' comments: 6 pages, accepted at GENNEXT@SIGIR25
Cristian-Ioan Blaga, Paul Suganthan, Sahil Dua, Krishna Srinivasan, Enrique Alfonseca, Peter Dornbach, Tom Duerig, Imed Zitouni et al.
Despite advances in multimodal learning, challenging benchmarks for mixed-modal image retrieval that combines visual and textual information are lacking. This paper introduces a novel benchmark to rigorously evaluate image retrieval that demands deep cross-modal contextual understanding. We present two new datasets: the Entity Image Dataset (EI), providing canonical images for Wikipedia entities, and the Mixed-Modal Image Retrieval Dataset (MMIR), derived from the WIT dataset. The MMIR benchmark features two challenging query types requiring models to ground textual descriptions in the context of provided visual entities: single entity-image queries (one entity image with descriptive text) and multi-entity-image queries (multiple entity images with relational text). We empirically validate the benchmark's utility as both a training corpus and an evaluation set for mixed-modal retrieval. The quality of both datasets is further affirmed through crowd-sourced human annotations. The datasets are accessible through the GitHub page: https://github.com/google-research-datasets/wit-retrieval.
Daniel Szelogowski
Despite substantial research into the biological basis of memory, the precise
mechanisms by which experiences are encoded, stored, and retrieved in the brain
remain incompletely understood. A growing body of evidence supports the engram
theory, which posits that sparse populations of neurons undergo lasting
physical and biochemical changes to support long-term memory. Yet, a
comprehensive computational framework that integrates biological findings with
mechanistic models remains elusive. This work synthesizes insights from
cellular neuroscience and computational modeling to address key challenges in
engram research: how engram neurons are identified and manipulated; how
synaptic plasticity mechanisms contribute to stable memory traces; and how
sparsity promotes efficient, interference-resistant representations. Relevant
computational approaches -- such as sparse regularization, engram gating, and
biologically inspired architectures like Sparse Distributed Memory and spiking
neural networks -- are also examined. Together, these findings suggest that
memory efficiency, capacity, and stability emerge from the interaction of
plasticity and sparsity constraints. By integrating neurobiological and
computational perspectives, this paper provides a comprehensive theoretical
foundation for engram research and proposes a roadmap for future inquiry into
the mechanisms underlying memory, with implications for the diagnosis and
treatment of memory-related disorders.
Authors' comments: 18 pages, 7 figures, 3 tables
Chihiro Maru, Shoetsu Sato
Inspired by the success of large language models (LLMs) in natural language processing, recent research has explored the building of time series foundation models and applied them to tasks such as forecasting, classification, and anomaly detection. However, their performances vary between different domains and tasks. In LLM-based approaches, test-time adaptation using example-based prompting has become common, owing to the high cost of retraining. In the context of anomaly detection, which is the focus of this study, providing normal examples from the target domain can also be effective. However, time series foundation models do not naturally acquire the ability to interpret or utilize examples or instructions, because the nature of time series data used during training does not encourage such capabilities. To address this limitation, we propose a retrieval augmented time series foundation model (RATFM), which enables pretrained time series foundation models to incorporate examples of test-time adaptation. We show that RATFM achieves a performance comparable to that of in-domain fine-tuning while avoiding domain-dependent fine-tuning. Experiments on the UCR Anomaly Archive, a multi-domain dataset including nine domains, confirms the effectiveness of the proposed approach.
Mojtaba Nayyeri, Athish A Yogi, Nadeen Fathallah, Ratan Bahadur Thapa, Hans-Michael Tautenhahn, Anton Schnurpel, Steffen Staab
Transforming relational databases into knowledge graphs with enriched
ontologies enhances semantic interoperability and unlocks advanced graph-based
learning and reasoning over data. However, previous approaches either demand
significant manual effort to derive an ontology from a database schema or
produce only a basic ontology. We present RIGOR, Retrieval-augmented Iterative
Generation of RDB Ontologies, an LLM-driven approach that turns relational
schemas into rich OWL ontologies with minimal human effort. RIGOR combines
three sources via RAG, the database schema and its documentation, a repository
of domain ontologies, and a growing core ontology, to prompt a generative LLM
for producing successive, provenance-tagged delta ontology fragments. Each
fragment is refined by a judge-LLM before being merged into the core ontology,
and the process iterates table-by-table following foreign key constraints until
coverage is complete. Applied to real-world databases, our approach outputs
ontologies that score highly on standard quality dimensions such as accuracy,
completeness, conciseness, adaptability, clarity, and consistency, while
substantially reducing manual effort.
Authors' comments: Under review
Chong Li, Xiangyang Xue, Jianfeng Feng, Taiping Zeng
Episodic memory enables humans to recall past experiences by associating semantic elements such as objects, locations, and time into coherent event representations. While large pretrained models have shown remarkable progress in modeling semantic memory, the mechanisms for forming associative structures that support episodic memory remain underexplored. Inspired by hippocampal CA3 dynamics and its role in associative memory, we propose the Latent Structured Hopfield Network (LSHN), a biologically inspired framework that integrates continuous Hopfield attractor dynamics into an autoencoder architecture. LSHN mimics the cortical-hippocampal pathway: a semantic encoder extracts compact latent representations, a latent Hopfield network performs associative refinement through attractor convergence, and a decoder reconstructs perceptual input. Unlike traditional Hopfield networks, our model is trained end-to-end with gradient descent, achieving scalable and robust memory retrieval. Experiments on MNIST, CIFAR-10, and a simulated episodic memory task demonstrate superior performance in recalling corrupted inputs under occlusion and noise, outperforming existing associative memory models. Our work provides a computational perspective on how semantic elements can be dynamically bound into episodic memory traces through biologically grounded attractor mechanisms. Code: https://github.com/fudan-birlab/LSHN.
Sa Zhu, Huashan Chen, Wanqian Zhang, Jinchao Zhang, Zexian Yang, Xiaoshuai Hao, Bo Li
Given a text query, partially relevant video retrieval (PRVR) aims to
retrieve untrimmed videos containing relevant moments, wherein event modeling
is crucial for partitioning the video into smaller temporal events that
partially correspond to the text. Previous methods typically segment videos
into a fixed number of equal-length clips, resulting in ambiguous event
boundaries. Additionally, they rely on mean pooling to compute event
representations, inevitably introducing undesired misalignment. To address
these, we propose an Uneven Event Modeling (UEM) framework for PRVR. We first
introduce the Progressive-Grouped Video Segmentation (PGVS) module, to
iteratively formulate events in light of both temporal dependencies and
semantic similarity between consecutive frames, enabling clear event
boundaries. Furthermore, we also propose the Context-Aware Event Refinement
(CAER) module to refine the event representation conditioned the text's
cross-attention. This enables event representations to focus on the most
relevant frames for a given text, facilitating more precise text-video
alignment. Extensive experiments demonstrate that our method achieves
state-of-the-art performance on two PRVR benchmarks.
Authors' comments: Accepted by ICME 2025
Yihe Dong, Lorenzo Noci, Mikhail Khodak, Mufan Li
The Transformer architecture is central to the success of modern Large Language Models (LLMs), in part due to its surprising ability to perform a wide range of algorithmic tasks -- including mathematical reasoning, memorization, and retrieval -- using only gradient-based training on next-token prediction. While the core component of a Transformer is the self-attention mechanism, we question how much, and which aspects, of the performance gains can be attributed to it. To this end, we compare standard Transformers to variants in which either the multi-layer perceptron (MLP) layers or the attention projectors (queries and keys) are frozen at initialization. To further isolate the contribution of attention, we introduce MixiT -- the Mixing Transformer -- a simplified, principled model in which the attention coefficients are entirely random and fixed at initialization, eliminating any input-dependent computation or learning in attention. Surprisingly, we find that MixiT matches the performance of fully trained Transformers on various algorithmic tasks, especially those involving basic arithmetic or focusing heavily on memorization. For retrieval-based tasks, we observe that having input-dependent attention coefficients is consistently beneficial, while MixiT underperforms. We attribute this failure to its inability to form specialized circuits such as induction heads -- a specific circuit known to be crucial for learning and exploiting repeating patterns in input sequences. Even more interestingly, we find that attention with frozen key and query projectors is not only able to form induction heads, but can also perform competitively on language modeling. Our results underscore the importance of architectural heterogeneity, where distinct components contribute complementary inductive biases crucial for solving different classes of tasks.
Yucheng Cai, Ke Li, Yi Huang, Junlan Feng, Zhijian Ou
A retriever, which retrieves relevant knowledge pieces from a knowledge base
given a context, is an important component in many natural language processing
(NLP) tasks. Retrievers have been introduced in knowledge-grounded dialog
systems to improve knowledge acquisition. In knowledge-grounded dialog systems,
when conditioning on a given context, there may be multiple relevant and
correlated knowledge pieces. However, knowledge pieces are usually assumed to
be conditionally independent in current retriever models. To address this
issue, we propose Entriever, an energy-based retriever. Entriever directly
models the candidate retrieval results as a whole instead of modeling the
knowledge pieces separately, with the relevance score defined by an energy
function. We explore various architectures of energy functions and different
training methods for Entriever, and show that Entriever substantially
outperforms the strong cross-encoder baseline in knowledge retrieval tasks.
Furthermore, we show that in semi-supervised training of knowledge-grounded
dialog systems, Entriever enables effective scoring of retrieved knowledge
pieces and significantly improves end-to-end performance of dialog systems.
Authors' comments: Accepted by ACL2025 Findings
Ankita Negi, Leon Merten Lohse, Sven Velten, Ilya Sergeev, Olaf Leupold, Sakshath Sadashivaiah, Dimitrios Bessas, Aleksandr Chumakhov et al.
Phase retrieval is at the heart of adaptive optics and modern high-resolution
imaging. Without phase information, optical systems are limited to
intensity-only measurements, hindering full reconstruction of object structures
and wavefront dynamics essential for advanced applications. Here, we address a
one-dimensional phase problem linking energy and time, which arises in X-ray
scattering from ultrasharp nuclear resonances. We leverage the M\"ossbauer
effect, where nuclei scatter radiation without energy loss to the lattice, and
are sensitive to their magneto-chemical environments. Rather than using
traditional spectroscopy with radioactive gamma-ray sources, we measure nuclear
forward scattering of synchrotron X-ray pulses in the time domain, providing
superior sensitivity and faster data acquisition. Extracting spectral
information from a single measurement is challenging due to the missing phase
information, typically requiring extensive modeling. Instead, we use multiple
energetically overlapping measurements to retrieve both the transmission
spectrum and the phase of the scattering response, similar to ptychographic
phase retrieval in imaging. Our robust approach can overcome the bandwidth
limitations of gamma-ray sources, opening new research directions with modern
X-ray sources and M\"ossbauer isotopes.
Authors' comments: 14 pages and 13 figures with supplementary, submitted for publication
Jiahui Geng, Fengyu Cai, Shaobo Cui, Qing Li, Liangwei Chen, Chenyang Lyu, Haonan Li, Derui Zhu et al.
Code retrieval is essential in modern software development, as it boosts code reuse and accelerates debugging. However, current benchmarks primarily emphasize functional relevance while neglecting critical dimensions of software quality. Motivated by this gap, we introduce CoQuIR, the first large-scale, multilingual benchmark specifically designed to evaluate quality-aware code retrieval across four key dimensions: correctness, efficiency, security, and maintainability. CoQuIR provides fine-grained quality annotations for 42,725 queries and 134,907 code snippets in 11 programming languages, and is accompanied by two quality-centric evaluation metrics: Pairwise Preference Accuracy and Margin-based Ranking Score. Using CoQuIR, we benchmark 23 retrieval models, covering both open-source and proprietary systems, and find that even top-performing models frequently fail to distinguish buggy or insecure code from their more robust counterparts. Furthermore, we conduct preliminary investigations into training methods that explicitly encourage retrievers to recognize code quality. Using synthetic datasets, we demonstrate promising improvements in quality-aware metrics across various models, without sacrificing semantic relevance. Downstream code generation experiments further validate the effectiveness of our approach. Overall, our work highlights the importance of integrating quality signals into code retrieval systems, laying the groundwork for more trustworthy and robust software development tools.
Aniketh Garikaparthi, Manasi Patwardhan, Aditya Sanjiv Kanade, Aman Hassan, Lovekesh Vig, Arman Cohan
There has been a surge of interest in harnessing the reasoning capabilities
of Large Language Models (LLMs) to accelerate scientific discovery. While
existing approaches rely on grounding the discovery process within the relevant
literature, effectiveness varies significantly with the quality and nature of
the retrieved literature. We address the challenge of retrieving prior work
whose concepts can inspire solutions for a given research problem, a task we
define as Methodology Inspiration Retrieval (MIR). We construct a novel dataset
tailored for training and evaluating retrievers on MIR, and establish
baselines. To address MIR, we build the Methodology Adjacency Graph (MAG);
capturing methodological lineage through citation relationships. We leverage
MAG to embed an "intuitive prior" into dense retrievers for identifying
patterns of methodological inspiration beyond superficial semantic similarity.
This achieves significant gains of +5.4 in Recall@3 and +7.8 in Mean Average
Precision (mAP) over strong baselines. Further, we adapt LLM-based re-ranking
strategies to MIR, yielding additional improvements of +4.5 in Recall@3 and
+4.8 in mAP. Through extensive ablation studies and qualitative analyses, we
exhibit the promise of MIR in enhancing automated scientific discovery and
outline avenues for advancing inspiration-driven retrieval.
Authors' comments: ACL 2025
Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng
Robustness and Effectiveness are critical aspects of developing dense retrieval models for real-world applications. It is known that there is a trade-off between the two. Recent work has addressed scaling laws of effectiveness in dense retrieval, revealing a power-law relationship between effectiveness and the size of models and data. Does robustness follow scaling laws too? If so, can scaling improve both robustness and effectiveness together, or do they remain locked in a trade-off? To answer these questions, we conduct a comprehensive experimental study. We find that:(i) Robustness, including out-of-distribution and adversarial robustness, also follows a scaling law.(ii) Robustness and effectiveness exhibit different scaling patterns, leading to significant resource costs when jointly improving both. Given these findings, we shift to the third factor that affects model performance, namely the optimization strategy, beyond the model size and data size. We find that: (i) By fitting different optimization strategies, the joint performance of robustness and effectiveness traces out a Pareto frontier. (ii) When the optimization strategy strays from Pareto efficiency, the joint performance scales in a sub-optimal direction. (iii) By adjusting the optimization weights to fit the Pareto efficiency, we can achieve Pareto training, where the scaling of joint performance becomes most efficient. Even without requiring additional resources, Pareto training is comparable to the performance of scaling resources several times under optimization strategies that overly prioritize either robustness or effectiveness. Finally, we demonstrate that our findings can help deploy dense retrieval models in real-world applications that scale efficiently and are balanced for robustness and effectiveness.
Fanhang Man, Xiaoyue Chen, Huandong Wang, Baining Zhao, Han Li, Xinlei Chen, Yong Li
Understanding what emotions images evoke in their viewers is a foundational goal in human-centric visual computing. While recent advances in vision-language models (VLMs) have shown promise for visual emotion analysis (VEA), several key challenges remain unresolved. Emotional cues in images are often abstract, overlapping, and entangled, making them difficult to model and interpret. Moreover, VLMs struggle to align these complex visual patterns with emotional semantics due to limited supervision and sparse emotional grounding. Finally, existing approaches lack structured affective knowledge to resolve ambiguity and ensure consistent emotional reasoning across diverse visual domains. To address these limitations, we propose \textbf{K-EVER\textsuperscript{2}}, a knowledge-enhanced framework for emotion reasoning and retrieval. Our approach introduces a semantically structured formulation of visual emotion cues and integrates external affective knowledge through multimodal alignment. Without relying on handcrafted labels or direct emotion supervision, K-EVER\textsuperscript{2} achieves robust and interpretable emotion predictions across heterogeneous image types. We validate our framework on three representative benchmarks, Emotion6, EmoSet, and M-Disaster, covering social media imagery, human-centric scenes, and disaster contexts. K-EVER\textsuperscript{2} consistently outperforms strong CNN and VLM baselines, achieving up to a \textbf{19\% accuracy gain} for specific emotions and a \textbf{12.3\% average accuracy gain} across all emotion categories. Our results demonstrate a scalable and generalizable solution for advancing emotional understanding of visual content.