Sa Zhu, Huashan Chen, Wanqian Zhang, Jinchao Zhang, Zexian Yang, Xiaoshuai Hao, Bo Li
Given a text query, partially relevant video retrieval (PRVR) aims to
retrieve untrimmed videos containing relevant moments, wherein event modeling
is crucial for partitioning the video into smaller temporal events that
partially correspond to the text. Previous methods typically segment videos
into a fixed number of equal-length clips, resulting in ambiguous event
boundaries. Additionally, they rely on mean pooling to compute event
representations, inevitably introducing undesired misalignment. To address
these, we propose an Uneven Event Modeling (UEM) framework for PRVR. We first
introduce the Progressive-Grouped Video Segmentation (PGVS) module, to
iteratively formulate events in light of both temporal dependencies and
semantic similarity between consecutive frames, enabling clear event
boundaries. Furthermore, we also propose the Context-Aware Event Refinement
(CAER) module to refine the event representation conditioned the text's
cross-attention. This enables event representations to focus on the most
relevant frames for a given text, facilitating more precise text-video
alignment. Extensive experiments demonstrate that our method achieves
state-of-the-art performance on two PRVR benchmarks.
Authors' comments: Accepted by ICME 2025
Yihe Dong, Lorenzo Noci, Mikhail Khodak, Mufan Li
The Transformer architecture is central to the success of modern Large Language Models (LLMs), in part due to its surprising ability to perform a wide range of algorithmic tasks -- including mathematical reasoning, memorization, and retrieval -- using only gradient-based training on next-token prediction. While the core component of a Transformer is the self-attention mechanism, we question how much, and which aspects, of the performance gains can be attributed to it. To this end, we compare standard Transformers to variants in which either the multi-layer perceptron (MLP) layers or the attention projectors (queries and keys) are frozen at initialization. To further isolate the contribution of attention, we introduce MixiT -- the Mixing Transformer -- a simplified, principled model in which the attention coefficients are entirely random and fixed at initialization, eliminating any input-dependent computation or learning in attention. Surprisingly, we find that MixiT matches the performance of fully trained Transformers on various algorithmic tasks, especially those involving basic arithmetic or focusing heavily on memorization. For retrieval-based tasks, we observe that having input-dependent attention coefficients is consistently beneficial, while MixiT underperforms. We attribute this failure to its inability to form specialized circuits such as induction heads -- a specific circuit known to be crucial for learning and exploiting repeating patterns in input sequences. Even more interestingly, we find that attention with frozen key and query projectors is not only able to form induction heads, but can also perform competitively on language modeling. Our results underscore the importance of architectural heterogeneity, where distinct components contribute complementary inductive biases crucial for solving different classes of tasks.
Yucheng Cai, Ke Li, Yi Huang, Junlan Feng, Zhijian Ou
A retriever, which retrieves relevant knowledge pieces from a knowledge base
given a context, is an important component in many natural language processing
(NLP) tasks. Retrievers have been introduced in knowledge-grounded dialog
systems to improve knowledge acquisition. In knowledge-grounded dialog systems,
when conditioning on a given context, there may be multiple relevant and
correlated knowledge pieces. However, knowledge pieces are usually assumed to
be conditionally independent in current retriever models. To address this
issue, we propose Entriever, an energy-based retriever. Entriever directly
models the candidate retrieval results as a whole instead of modeling the
knowledge pieces separately, with the relevance score defined by an energy
function. We explore various architectures of energy functions and different
training methods for Entriever, and show that Entriever substantially
outperforms the strong cross-encoder baseline in knowledge retrieval tasks.
Furthermore, we show that in semi-supervised training of knowledge-grounded
dialog systems, Entriever enables effective scoring of retrieved knowledge
pieces and significantly improves end-to-end performance of dialog systems.
Authors' comments: Accepted by ACL2025 Findings
Ankita Negi, Leon Merten Lohse, Sven Velten, Ilya Sergeev, Olaf Leupold, Sakshath Sadashivaiah, Dimitrios Bessas, Aleksandr Chumakhov et al.
Phase retrieval is at the heart of adaptive optics and modern high-resolution
imaging. Without phase information, optical systems are limited to
intensity-only measurements, hindering full reconstruction of object structures
and wavefront dynamics essential for advanced applications. Here, we address a
one-dimensional phase problem linking energy and time, which arises in X-ray
scattering from ultrasharp nuclear resonances. We leverage the M\"ossbauer
effect, where nuclei scatter radiation without energy loss to the lattice, and
are sensitive to their magneto-chemical environments. Rather than using
traditional spectroscopy with radioactive gamma-ray sources, we measure nuclear
forward scattering of synchrotron X-ray pulses in the time domain, providing
superior sensitivity and faster data acquisition. Extracting spectral
information from a single measurement is challenging due to the missing phase
information, typically requiring extensive modeling. Instead, we use multiple
energetically overlapping measurements to retrieve both the transmission
spectrum and the phase of the scattering response, similar to ptychographic
phase retrieval in imaging. Our robust approach can overcome the bandwidth
limitations of gamma-ray sources, opening new research directions with modern
X-ray sources and M\"ossbauer isotopes.
Authors' comments: 14 pages and 13 figures with supplementary, submitted for publication
Jiahui Geng, Fengyu Cai, Shaobo Cui, Qing Li, Liangwei Chen, Chenyang Lyu, Haonan Li, Derui Zhu et al.
Code retrieval is essential in modern software development, as it boosts code reuse and accelerates debugging. However, current benchmarks primarily emphasize functional relevance while neglecting critical dimensions of software quality. Motivated by this gap, we introduce CoQuIR, the first large-scale, multilingual benchmark specifically designed to evaluate quality-aware code retrieval across four key dimensions: correctness, efficiency, security, and maintainability. CoQuIR provides fine-grained quality annotations for 42,725 queries and 134,907 code snippets in 11 programming languages, and is accompanied by two quality-centric evaluation metrics: Pairwise Preference Accuracy and Margin-based Ranking Score. Using CoQuIR, we benchmark 23 retrieval models, covering both open-source and proprietary systems, and find that even top-performing models frequently fail to distinguish buggy or insecure code from their more robust counterparts. Furthermore, we conduct preliminary investigations into training methods that explicitly encourage retrievers to recognize code quality. Using synthetic datasets, we demonstrate promising improvements in quality-aware metrics across various models, without sacrificing semantic relevance. Downstream code generation experiments further validate the effectiveness of our approach. Overall, our work highlights the importance of integrating quality signals into code retrieval systems, laying the groundwork for more trustworthy and robust software development tools.
Aniketh Garikaparthi, Manasi Patwardhan, Aditya Sanjiv Kanade, Aman Hassan, Lovekesh Vig, Arman Cohan
There has been a surge of interest in harnessing the reasoning capabilities
of Large Language Models (LLMs) to accelerate scientific discovery. While
existing approaches rely on grounding the discovery process within the relevant
literature, effectiveness varies significantly with the quality and nature of
the retrieved literature. We address the challenge of retrieving prior work
whose concepts can inspire solutions for a given research problem, a task we
define as Methodology Inspiration Retrieval (MIR). We construct a novel dataset
tailored for training and evaluating retrievers on MIR, and establish
baselines. To address MIR, we build the Methodology Adjacency Graph (MAG);
capturing methodological lineage through citation relationships. We leverage
MAG to embed an "intuitive prior" into dense retrievers for identifying
patterns of methodological inspiration beyond superficial semantic similarity.
This achieves significant gains of +5.4 in Recall@3 and +7.8 in Mean Average
Precision (mAP) over strong baselines. Further, we adapt LLM-based re-ranking
strategies to MIR, yielding additional improvements of +4.5 in Recall@3 and
+4.8 in mAP. Through extensive ablation studies and qualitative analyses, we
exhibit the promise of MIR in enhancing automated scientific discovery and
outline avenues for advancing inspiration-driven retrieval.
Authors' comments: ACL 2025
Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng
Robustness and Effectiveness are critical aspects of developing dense retrieval models for real-world applications. It is known that there is a trade-off between the two. Recent work has addressed scaling laws of effectiveness in dense retrieval, revealing a power-law relationship between effectiveness and the size of models and data. Does robustness follow scaling laws too? If so, can scaling improve both robustness and effectiveness together, or do they remain locked in a trade-off? To answer these questions, we conduct a comprehensive experimental study. We find that:(i) Robustness, including out-of-distribution and adversarial robustness, also follows a scaling law.(ii) Robustness and effectiveness exhibit different scaling patterns, leading to significant resource costs when jointly improving both. Given these findings, we shift to the third factor that affects model performance, namely the optimization strategy, beyond the model size and data size. We find that: (i) By fitting different optimization strategies, the joint performance of robustness and effectiveness traces out a Pareto frontier. (ii) When the optimization strategy strays from Pareto efficiency, the joint performance scales in a sub-optimal direction. (iii) By adjusting the optimization weights to fit the Pareto efficiency, we can achieve Pareto training, where the scaling of joint performance becomes most efficient. Even without requiring additional resources, Pareto training is comparable to the performance of scaling resources several times under optimization strategies that overly prioritize either robustness or effectiveness. Finally, we demonstrate that our findings can help deploy dense retrieval models in real-world applications that scale efficiently and are balanced for robustness and effectiveness.
Fanhang Man, Xiaoyue Chen, Huandong Wang, Baining Zhao, Han Li, Xinlei Chen, Yong Li
Understanding what emotions images evoke in their viewers is a foundational goal in human-centric visual computing. While recent advances in vision-language models (VLMs) have shown promise for visual emotion analysis (VEA), several key challenges remain unresolved. Emotional cues in images are often abstract, overlapping, and entangled, making them difficult to model and interpret. Moreover, VLMs struggle to align these complex visual patterns with emotional semantics due to limited supervision and sparse emotional grounding. Finally, existing approaches lack structured affective knowledge to resolve ambiguity and ensure consistent emotional reasoning across diverse visual domains. To address these limitations, we propose \textbf{K-EVER\textsuperscript{2}}, a knowledge-enhanced framework for emotion reasoning and retrieval. Our approach introduces a semantically structured formulation of visual emotion cues and integrates external affective knowledge through multimodal alignment. Without relying on handcrafted labels or direct emotion supervision, K-EVER\textsuperscript{2} achieves robust and interpretable emotion predictions across heterogeneous image types. We validate our framework on three representative benchmarks, Emotion6, EmoSet, and M-Disaster, covering social media imagery, human-centric scenes, and disaster contexts. K-EVER\textsuperscript{2} consistently outperforms strong CNN and VLM baselines, achieving up to a \textbf{19\% accuracy gain} for specific emotions and a \textbf{12.3\% average accuracy gain} across all emotion categories. Our results demonstrate a scalable and generalizable solution for advancing emotional understanding of visual content.
Guangyuan Liu, Yinqiu Liu, Ruichen Zhang, Hongyang Du, Dusit Niyato, Zehui Xiong, Sumei Sun, Abbas Jamalipour
The rapid development of multimodal AI and Large Language Models (LLMs) has greatly enhanced real-time interaction, decision-making, and collaborative tasks. However, in wireless multi-agent scenarios, limited bandwidth poses significant challenges to exchanging semantically rich multimodal information efficiently. Traditional semantic communication methods, though effective, struggle with redundancy and loss of crucial details. To overcome these challenges, we propose a Retrieval-Augmented Multimodal Semantic Communication (RAMSemCom) framework. RAMSemCom incorporates iterative, retrieval-driven semantic refinement tailored for distributed multi-agent environments, enabling efficient exchange of critical multimodal elements through local caching and selective transmission. Our approach dynamically optimizes retrieval using deep reinforcement learning (DRL) to balance semantic fidelity with bandwidth constraints. A comprehensive case study on multi-agent autonomous driving demonstrates that our DRL-based retrieval strategy significantly improves task completion efficiency and reduces communication overhead compared to baseline methods.
Thushara Manjari Naduvilakandy, Hyeju Jang, Mohammad Al Hasan
Causality detection and mining are important tasks in information retrieval
due to their enormous use in information extraction, and knowledge graph
construction. To solve these tasks, in existing literature there exist several
solutions -- both unsupervised and supervised. However, the unsupervised
methods suffer from poor performance and they often require significant human
intervention for causal rule selection, leading to poor generalization across
different domains. On the other hand, supervised methods suffer from the lack
of large training datasets. Recently, large language models (LLMs) with
effective prompt engineering are found to be effective to overcome the issue of
unavailability of large training dataset. Yet, in existing literature, there
does not exist comprehensive works on causality detection and mining using LLM
prompting. In this paper, we present several retrieval-augmented generation
(RAG) based dynamic prompting schemes to enhance LLM performance in causality
detection and extraction tasks. Extensive experiments over three datasets and
five LLMs validate the superiority of our proposed RAG-based dynamic prompting
over other static prompting schemes.
Authors' comments: 13 pages, 6 figures, published in knowledgeNLP-NAACL2025
Adriano Fragomeni, Dima Damen, Michael Wray
Text-to-Video (T2V) retrieval aims to identify the most relevant item from a gallery of videos based on a user's text query. Traditional methods rely solely on aligning video and text modalities to compute the similarity and retrieve relevant items. However, recent advancements emphasise incorporating auxiliary information extracted from video and text modalities to improve retrieval performance and bridge the semantic gap between these modalities. Auxiliary information can include visual attributes, such as objects; temporal and spatial context; and textual descriptions, such as speech and rephrased captions. This survey comprehensively reviews 81 research papers on Text-to-Video retrieval that utilise such auxiliary information. It provides a detailed analysis of their methodologies; highlights state-of-the-art results on benchmark datasets; and discusses available datasets and their auxiliary information. Additionally, it proposes promising directions for future research, focusing on different ways to further enhance retrieval performance using this information.
Michael Grohs, Adrian Rebmann, Jana-Rebecca Rehse
Conformance checking techniques detect undesired process behavior by
comparing process executions that are recorded in event logs to desired
behavior that is captured in a dedicated process model. If such models are not
available, conformance checking techniques are not applicable, but
organizations might still be interested in detecting undesired behavior in
their processes. To enable this, existing approaches use Large Language Models
(LLMs), assuming that they can learn to distinguish desired from undesired
behavior through fine-tuning. However, fine-tuning is highly resource-intensive
and the fine-tuned LLMs often do not generalize well. To address these
limitations, we propose an approach that requires neither a dedicated process
model nor resource-intensive fine-tuning to detect undesired process behavior.
Instead, we use Retrieval Augmented Generation (RAG) to provide an LLM with
direct access to a knowledge base that contains both desired and undesired
process behavior from other processes, assuming that the LLM can transfer this
knowledge to the process at hand. Our evaluation shows that our approach
outperforms fine-tuned LLMs in detecting undesired behavior, demonstrating that
RAG is a viable alternative to resource-intensive fine-tuning, particularly
when enriched with relevant context from the event log, such as frequent traces
and activities.
Authors' comments: Accepted at the BPM Forum, located at the International Conference on
Business Process Management (BPM) 2025
Shuyang Cao, Karthik Radhakrishnan, David Rosenberg, Steven Lu, Pengxiang Cheng, Lu Wang, Shiyue Zhang
Retrieval-augmented generation (RAG) generally enhances large language
models' (LLMs) ability to solve knowledge-intensive tasks. But RAG may also
lead to performance degradation due to imperfect retrieval and the model's
limited ability to leverage retrieved content. In this work, we evaluate the
robustness of LLMs in practical RAG setups (henceforth retrieval robustness).
We focus on three research questions: (1) whether RAG is always better than
non-RAG; (2) whether more retrieved documents always lead to better
performance; (3) and whether document orders impact results. To facilitate this
study, we establish a benchmark of 1500 open-domain questions, each with
retrieved documents from Wikipedia. We introduce three robustness metrics, each
corresponds to one research question. Our comprehensive experiments, involving
11 LLMs and 3 prompting strategies, reveal that all of these LLMs exhibit
surprisingly high retrieval robustness; nonetheless, different degrees of
imperfect robustness hinders them from fully utilizing the benefits of RAG.
Authors' comments: 19 pages
Chaeeun Kim, Jinu Lee, Wonseok Hwang
Legal Case Retrieval (LCR), which retrieves relevant cases from a query case,
is a fundamental task for legal professionals in research and decision-making.
However, existing studies on LCR face two major limitations. First, they are
evaluated on relatively small-scale retrieval corpora (e.g., 100-55K cases) and
use a narrow range of criminal query types, which cannot sufficiently reflect
the complexity of real-world legal retrieval scenarios. Second, their reliance
on embedding-based or lexical matching methods often results in limited
representations and legally irrelevant matches. To address these issues, we
present: (1) LEGAR BENCH, the first large-scale Korean LCR benchmark, covering
411 diverse crime types in queries over 1.2M legal cases; and (2)
LegalSearchLM, a retrieval model that performs legal element reasoning over the
query case and directly generates content grounded in the target cases through
constrained decoding. Experimental results show that LegalSearchLM outperforms
baselines by 6-20% on LEGAR BENCH, achieving state-of-the-art performance. It
also demonstrates strong generalization to out-of-domain cases, outperforming
naive generative models trained on in-domain data by 15%.
Authors' comments: Under review
Yunyi Zhang, Ruozhen Yang, Siqi Jiao, SeongKu Kang, Jiawei Han
Scientific paper retrieval is essential for supporting literature discovery and research. While dense retrieval methods demonstrate effectiveness in general-purpose tasks, they often fail to capture fine-grained scientific concepts that are essential for accurate understanding of scientific queries. Recent studies also use large language models (LLMs) for query understanding; however, these methods often lack grounding in corpus-specific knowledge and may generate unreliable or unfaithful content. To overcome these limitations, we propose SemRank, an effective and efficient paper retrieval framework that combines LLM-guided query understanding with a concept-based semantic index. Each paper is indexed using multi-granular scientific concepts, including general research topics and detailed key phrases. At query time, an LLM identifies core concepts derived from the corpus to explicitly capture the query's information need. These identified concepts enable precise semantic matching, significantly enhancing retrieval accuracy. Experiments show that SemRank consistently improves the performance of various base retrievers, surpasses strong existing LLM-based baselines, and remains highly efficient.
Qifeng Wu, Zhengzhe Liu, Han Zhu, Yizhou Zhao, Daisuke Kihara, Min Xu
This paper aims to retrieve proteins with similar structures and semantics
from large-scale protein dataset, facilitating the functional interpretation of
protein structures derived by structural determination methods like
cryo-Electron Microscopy (cryo-EM). Motivated by the recent progress of
vision-language models (VLMs), we propose a CLIP-style framework for aligning
3D protein structures with functional annotations using contrastive learning.
For model training, we propose a large-scale dataset of approximately 200,000
protein-caption pairs with rich functional descriptors. We evaluate our model
in both in-domain and more challenging cross-database retrieval on Protein Data
Bank (PDB) and Electron Microscopy Data Bank (EMDB) dataset, respectively. In
both cases, our approach demonstrates promising zero-shot retrieval
performance, highlighting the potential of multimodal foundation models for
structure-function understanding in protein biology.
Authors' comments: 4 pages for body, 3 pages for appendix, 11 figures. Accepted to CVPR
2025 Workshop on Multimodal Foundation Models for Biomedicine: Challenges and
Opportunities(MMFM-BIOMED)
Yuhao Wang, Ruiyang Ren, Yucheng Wang, Wayne Xin Zhao, Jing Liu, Hua Wu, Haifeng Wang
Long-form question answering (LFQA) presents unique challenges for large language models, requiring the synthesis of coherent, paragraph-length answers. While retrieval-augmented generation (RAG) systems have emerged as a promising solution, existing research struggles with key limitations: the scarcity of high-quality training data for long-form generation, the compounding risk of hallucination in extended outputs, and the absence of reliable evaluation metrics for factual completeness. In this paper, we propose RioRAG, a novel reinforcement learning (RL) framework that advances long-form RAG through reinforced informativeness optimization. Our approach introduces two fundamental innovations to address the core challenges. First, we develop an RL training paradigm of reinforced informativeness optimization that directly optimizes informativeness and effectively addresses the slow-thinking deficit in conventional RAG systems, bypassing the need for expensive supervised data. Second, we propose a nugget-centric hierarchical reward modeling approach that enables precise assessment of long-form answers through a three-stage process: extracting the nugget from every source webpage, constructing a nugget claim checklist, and computing rewards based on factual alignment. Extensive experiments on two LFQA benchmarks LongFact and RAGChecker demonstrate the effectiveness of the proposed method. Our codes are available at https://github.com/RUCAIBox/RioRAG.
Huiyao Chen, Yi Yang, Yinghui Li, Meishan Zhang, Min Zhang
Long document understanding has become increasingly crucial in natural
language processing, with retrieval-based methods emerging as a promising
solution to address the context length limitations of large language models
(LLMs). However, existing approaches either treat documents as flat sequences
or employ arbitrary chunking strategies, failing to capture the inherent
discourse structure that guides human comprehension. We present DISRetrieval, a
novel hierarchical retrieval framework that leverages linguistic discourse
structure to enhance long document understanding. Our approach introduces three
key innovations: (1) a discourse-aware document organization framework that
utilizes rhetorical structure theory (RST) to create sentence-level
hierarchical representations, preserving both semantic relationships and
natural document flow; (2) an LLM-enhanced node representation technique that
combines discourse structure with adaptive summarization to enrich tree nodes
with contextual information; and (3) a hierarchical evidence retrieval
mechanism that effectively selects relevant content while maintaining discourse
coherence. Through comprehensive experiments on QASPER and QuALITY datasets,
DISRetrieval demonstrates substantial improvements over existing methods in
both token-level retrieval metrics and downstream question answering tasks. Our
ablation studies confirm that incorporating discourse structure significantly
enhances retrieval effectiveness across different document lengths and query
types, validating the importance of linguistically-informed document
representation in long-text understanding. Our code and datasets are publicly
available at github/DreamH1gh/DISRetrieval to facilitate future research.
Authors' comments: 21 pages, 7 figures
Yanzhen Shen, Sihao Chen, Xueqiang Xu, Yunyi Zhang, Chaitanya Malaviya, Dan Roth
While significant progress has been made with dual- and bi-encoder dense retrievers, they often struggle on queries with logical connectives, a use case that is often overlooked yet important in downstream applications. Current dense retrievers struggle with such queries, such that the retrieved results do not respect the logical constraints implied in the queries. To address this challenge, we introduce LogiCoL, a logically-informed contrastive learning objective for dense retrievers. LogiCoL builds upon in-batch supervised contrastive learning, and learns dense retrievers to respect the subset and mutually-exclusive set relation between query results via two sets of soft constraints expressed via t-norm in the learning objective. We evaluate the effectiveness of LogiCoL on the task of entity retrieval, where the model is expected to retrieve a set of entities in Wikipedia that satisfy the implicit logical constraints in the query. We show that models trained with LogiCoL yield improvement both in terms of retrieval performance and logical consistency in the results. We provide detailed analysis and insights to uncover why queries with logical connectives are challenging for dense retrievers and why LogiCoL is most effective.
Juntong Wu, Zijing Liu, He Cao, Hao Li, Bin Feng, Zishan Shu, Ke Yu, Li Yuan et al.
In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model's performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data can be seen at https://github.com/IDEA-XL/RAPM.