Nikhil N S, Amol Dilip Joshi, Bilal Muhammed, Soban Babu
Selecting a solution algorithm for the Facility Layout Problem (FLP), an
NP-hard optimization problem with a multiobjective trade-off, is a complex task
that requires deep expert knowledge. The performance of a given algorithm
depends on specific problem characteristics such as its scale, objectives, and
constraints. This creates a need for a data-driven recommendation method to
guide algorithm selection in automated design systems. This paper introduces a
new recommendation method to make such expertise accessible, based on a
Knowledge Graph-based Retrieval-Augmented Generation (KG RAG) framework. To
address this, a domain-specific knowledge graph is constructed from published
literature. The method then employs a multi-faceted retrieval mechanism to
gather relevant evidence from this knowledge graph using three distinct
approaches, which include a precise graph-based search, flexible vector-based
search, and high-level cluster-based search. The retrieved evidence is utilized
by a Large Language Model (LLM) to generate algorithm recommendations with
data-driven reasoning. The proposed KG-RAG method is compared against a
commercial LLM chatbot with access to the knowledge base as a table, across a
series of diverse, real-world FLP test cases. Based on recommendation accuracy
and reasoning capability, the proposed method performed significantly better
than the commercial LLM chatbot.
Authors' comments: 10 pages, 5 figures
Jüri Keller, Maik Fröbe, Gijs Hendriksen, Daria Alexander, Martin Potthast, Philipp Schaer
The longitudinal evaluation of retrieval systems aims to capture how
information needs and documents evolve over time. However, classical
Cranfield-style retrieval evaluations only consist of a static set of queries
and documents and thereby miss time as an evaluation dimension. Therefore,
longitudinal evaluations need to complement retrieval toolkits with custom
logic. This custom logic increases the complexity of research software, which
might reduce the reproducibility and extensibility of experiments. Based on our
submissions to the 2024 edition of LongEval, we propose a custom extension of
ir_datasets for longitudinal retrieval experiments. This extension allows for
declaratively, instead of imperatively, describing important aspects of
longitudinal retrieval experiments, e.g., which queries, documents, and/or
relevance feedback are available at which point in time. We reimplement our
submissions to LongEval 2024 against our new ir_datasets extension, and find
that the declarative access can reduce the complexity of the code.
Authors' comments: Best of labs paper for LongEval at CLEF 2024
Kunrong Li, Kwan Hui Lim
Next point-of-interest (POI) recommendation predicts a user's next
destination from historical movements. Traditional models require intensive
training, while LLMs offer flexible and generalizable zero-shot solutions but
often generate generic or geographically irrelevant results due to missing
trajectory and spatial context. To address these issues, we propose RALLM-POI,
a framework that couples LLMs with retrieval-augmented generation and
self-rectification. We first propose a Historical Trajectory Retriever (HTR)
that retrieves relevant past trajectories to serve as contextual references,
which are then reranked by a Geographical Distance Reranker (GDR) for
prioritizing spatially relevant trajectories. Lastly, an Agentic LLM Rectifier
(ALR) is designed to refine outputs through self-reflection. Without additional
training, RALLM-POI achieves substantial accuracy gains across three real-world
Foursquare datasets, outperforming both conventional and LLM-based baselines.
Code is released at https://github.com/LKRcrocodile/RALLM-POI.
Authors' comments: PRICAI 2025
Peng Wang, Yong Li, Lin Zhao, Xiu-Shen Wei
Fine-grained hashing has become a powerful solution for rapid and efficient image retrieval, particularly in scenarios requiring high discrimination between visually similar categories. To enable each hash bit to correspond to specific visual attributes, we propoe a novel method that harnesses learnable queries for attribute-aware hash codes learning. This method deploys a tailored set of queries to capture and represent nuanced attribute-level information within the hashing process, thereby enhancing both the interpretability and relevance of each hash bit. Building on this query-based optimization framework, we incorporate an auxiliary branch to help alleviate the challenges of complex landscape optimization often encountered with low-bit hash codes. This auxiliary branch models high-order attribute interactions, reinforcing the robustness and specificity of the generated hash codes. Experimental results on benchmark datasets demonstrate that our method generates attribute-aware hash codes and consistently outperforms state-of-the-art techniques in retrieval accuracy and robustness, especially for low-bit hash codes, underscoring its potential in fine-grained image hashing tasks.
Eason Chen, Chuangji Li, Shizhuo Li, Conrad Borchers, Zimo Xiao, Chloe Qianhui Zhao, Jionghao Lin, Kenneth R. Koedinger
Technology-enhanced learning environments often help students retrieve relevant learning content for questions arising during self-paced study. Large language models (LLMs) have emerged as novel aids for information retrieval during learning. While LLMs are effective for general-purpose question-answering, they typically lack alignment with the domain knowledge of specific course materials such as textbooks and slides. We investigate Retrieval-Augmented Generation (RAG) and GraphRAG, a knowledge graph-enhanced RAG approach, for page-level question answering in an undergraduate mathematics textbook. While RAG has been effective for retrieving discrete, contextually relevant passages, GraphRAG may excel in modeling interconnected concepts and hierarchical knowledge structures. We curate a dataset of 477 question-answer pairs, each tied to a distinct textbook page. We then compare the standard embedding-based RAG methods to GraphRAG for evaluating both retrieval accuracy-whether the correct page is retrieved-and generated answer quality via F1 scores. Our findings show that embedding-based RAG achieves higher retrieval accuracy and better F1 scores compared to GraphRAG, which tends to retrieve excessive and sometimes irrelevant content due to its entity-based structure. We also explored re-ranking the retrieved pages with LLM and observed mixed results, including performance drop and hallucinations when dealing with larger context windows. Overall, this study highlights both the promises and challenges of page-level retrieval systems in educational contexts, emphasizing the need for more refined retrieval methods to build reliable AI tutoring solutions in providing reference page numbers.
Zengli Luo, Canlong Zhang, Xiaochun Lu, Zhixin Li
Text-based Pedestrian Retrieval (TPR) aims to retrieve specific target
pedestrians in visual scenes according to natural language descriptions.
Although existing methods have achieved progress under constrained settings,
interactive retrieval in the open-world scenario still suffers from limited
model generalization and insufficient semantic understanding. To address these
challenges, we propose FitPro, an open-world interactive zero-shot TPR
framework with enhanced semantic comprehension and cross-scene adaptability.
FitPro has three innovative components: Feature Contrastive Decoding (FCD),
Incremental Semantic Mining (ISM), and Query-aware Hierarchical Retrieval
(QHR). The FCD integrates prompt-guided contrastive decoding to generate
high-quality structured pedestrian descriptions from denoised images,
effectively alleviating semantic drift in zero-shot scenarios. The ISM
constructs holistic pedestrian representations from multi-view observations to
achieve global semantic modeling in multi-turn interactions,thereby improving
robustness against viewpoint shifts and fine-grained variations in
descriptions. The QHR dynamically optimizes the retrieval pipeline according to
query types, enabling efficient adaptation to multi-modal and multi-view
inputs. Extensive experiments on five public datasets and two evaluation
protocols demonstrate that FitPro significantly overcomes the generalization
limitations and semantic modeling constraints of existing methods in
interactive retrieval, paving the way for practical deployment. The code and
data will be released at https://github.com/
lilo4096/FitPro-Interactive-Person-Retrieval.
Authors' comments: 15pages,6 figures
Ji Soo Lee, Byungoh Ko, Jaewon Cho, Howoong Lee, Jaewoon Byun, Hyunwoo J. Kim
In text-video retrieval, auxiliary captions are often used to enhance video
understanding, bridging the gap between the modalities. While recent advances
in multi-modal large language models (MLLMs) have enabled strong zero-shot
caption generation, we observe that such captions tend to be generic and
indistinguishable across visually similar videos, limiting their utility for
fine-grained retrieval. Moreover, conventional captioning approaches are
typically evaluated using language generation metrics, such as BLEU, which are
not typically tailored for retrieval tasks that require making discriminative
distinctions between candidates. To address this, we propose
$\textbf{CaRe-DPO}$, a retrieval framework that directly optimizes caption
generation using retrieval relevance scores. At its core is Dual-Group Direct
Preference Optimization (DG-DPO), a novel learning strategy that supervises
captioning by modeling preferences across groups of distinct video and caption
pairs. In addition, we present an MLLM-based retrieval model that incorporates
role-embeddings to better distinguish between textual inputs with different
functional roles, such as an auxiliary caption and a text query. Through
extensive experiments, we demonstrate that CaRe-DPO significantly enhances
retrieval performance by effectively leveraging auxiliary knowledge to generate
fine-grained captions for retrieval. Code is available at
https://github.com/mlvlab/CaReDPO.
Authors' comments: EMNLP 2025 Findings
Cheng Jiayang, Qianqian Zhuang, Haoran Li, Chunkit Chan, Xin Liu, Lin Qiu, Yangqiu Song
Grounding large language models (LLMs) in external knowledge sources is a
promising method for faithful prediction. While existing grounding approaches
work well for simple queries, many real-world information needs require
synthesizing multiple pieces of evidence. We introduce "integrative grounding"
-- the challenge of retrieving and verifying multiple inter-dependent pieces of
evidence to support a hypothesis query. To systematically study this problem,
we repurpose data from four domains for evaluating integrative grounding
capabilities. Our investigation reveals two critical findings: First, in
groundedness verification, while LLMs are robust to redundant evidence, they
tend to rationalize using internal knowledge when information is incomplete.
Second, in examining retrieval planning strategies, we find that undirected
planning can degrade performance through noise introduction, while premise
abduction emerges as a promising approach due to its logical constraints.
Additionally, LLMs' zero-shot self-reflection capabilities consistently improve
grounding quality. These insights provide valuable direction for developing
more effective integrative grounding systems.
Authors' comments: Accepted to EMNLP 2025 Findings
Tomoaki Isoda
Retrieval-Augmented Generation (RAG) has significantly improved the performance of large language models (LLMs) on knowledge-intensive tasks in recent years. However, since retrieval systems may return irrelevant content, incorporating such information into the model often leads to hallucinations. Thus, identifying and filtering out unhelpful retrieved content is a key challenge for improving RAG performance.To better integrate the internal knowledge of the model with external knowledge from retrieval, it is essential to understand what the model "knows" and "does not know" (which is also called "self-knowledge"). Based on this insight, we propose SKILL-RAG (Self-Knowledge Induced Learning and Filtering for RAG), a novel method that leverages the model's self-knowledge to determine which retrieved documents are beneficial for answering a given query. We design a reinforcement learning-based training framework to explicitly elicit self-knowledge from the model and employs sentence-level granularity to filter out irrelevant content while preserving useful knowledge.We evaluate SKILL-RAG using Llama2-7B and Qwen3-8B on several question answering benchmarks. Experimental results demonstrate that SKILL-RAG not only improves generation quality but also significantly reduces the number of input documents, validating the importance of self-knowledge in guiding the selection of high-quality retrievals.
Sheng Zhang, Yifan Ding, Shuquan Lian, Shun Song, Hui Li
Repository-level code completion automatically predicts the unfinished code
based on the broader information from the repository. Recent strides in Code
Large Language Models (code LLMs) have spurred the development of
repository-level code completion methods, yielding promising results.
Nevertheless, they suffer from issues such as inappropriate query construction,
single-path code retrieval, and misalignment between code retriever and code
LLM. To address these problems, we introduce CodeRAG, a framework tailored to
identify relevant and necessary knowledge for retrieval-augmented
repository-level code completion. Its core components include log probability
guided query construction, multi-path code retrieval, and preference-aligned
BestFit reranking. Extensive experiments on benchmarks ReccEval and CCEval
demonstrate that CodeRAG significantly and consistently outperforms
state-of-the-art methods. The implementation of CodeRAG is available at
https://github.com/KDEGroup/CodeRAG.
Authors' comments: EMNLP 2025
Xinchen Wan, Jinhua Liang, Huan Zhang
Existing digital mental wellness tools often overlook the nuanced emotional
states underlying everyday challenges. For example, pre-sleep anxiety affects
more than 1.5 billion people worldwide, yet current approaches remain largely
static and "one-size-fits-all", failing to adapt to individual needs. In this
work, we present EmoHeal, an end-to-end system that delivers personalized,
three-stage supportive narratives. EmoHeal detects 27 fine-grained emotions
from user text with a fine-tuned XLM-RoBERTa model, mapping them to musical
parameters via a knowledge graph grounded in music therapy principles (GEMS,
iso-principle). EmoHeal retrieves audiovisual content using the CLAMP3 model to
guide users from their current state toward a calmer one
("match-guide-target"). A within-subjects study (N=40) demonstrated significant
supportive effects, with participants reporting substantial mood improvement
(M=4.12, p<0.001) and high perceived emotion recognition accuracy (M=4.05,
p<0.001). A strong correlation between perceived accuracy and therapeutic
outcome (r=0.72, p<0.001) validates our fine-grained approach. These findings
establish the viability of theory-driven, emotion-aware digital wellness tools
and provides a scalable AI blueprint for operationalizing music therapy
principles.
Authors' comments: 5 pages, 5 figures. Submitted to the 2026 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)
Jisu Kim, Jinhee Park, Changhyun Jeon, Jungwoo Choi, Keonwoo Kim, Minji Hong, Sehyun Kim
Traditional query expansion techniques for addressing vocabulary mismatch problems in information retrieval are context-sensitive and may lead to performance degradation. As an alternative, document expansion research has gained attention, but existing methods such as Doc2Query have limitations including excessive preprocessing costs, increased index size, and reliability issues with generated content. To mitigate these problems and seek more structured and efficient alternatives, this study proposes a method that divides documents into chunk units and generates textual data for each chunk to simultaneously improve retrieval efficiency and accuracy. The proposed "Chunk Knowledge Generation Model" adopts a T5-based multi-task learning structure that simultaneously generates titles and candidate questions from each document chunk while extracting keywords from user queries. This approach maximizes computational efficiency by generating and extracting three types of semantic information in parallel through a single encoding and two decoding processes. The generated data is utilized as additional information in the retrieval system. GPT-based evaluation on 305 query-document pairs showed that retrieval using the proposed model achieved 95.41% accuracy at Top@10, demonstrating superior performance compared to document chunk-level retrieval. This study contributes by proposing an approach that simultaneously generates titles and candidate questions from document chunks for application in retrieval pipelines, and provides empirical evidence applicable to large-scale information retrieval systems by demonstrating improved retrieval accuracy through qualitative evaluation.
Hideo Hirose
Checkered patterns are characterized by their square structure and the use of
only two distinct colors. These colors are typically represented by two types
of numerical sets: {1,0} and {1,-1}. Matrices based on {1,0} may seem identical
to those based on {1,-1} when forming checkered patterns because the only
difference is that the numbers 0 are changed to -1. However, these two kinds of
matrices are completely different in a mathematical sense because a matrix
using {1,0} has a rank of 2 and a matrix using {1,-1} has a rank of 1. Knowing
this difference in advance allows us to reduce the computational effort
required for matrix operations such as information embedding and retrieving.
Authors' comments: 8 pages, 3 figures
Shuo Xing, Peiran Li, Yuping Wang, Ruizheng Bai, Yueqi Wang, Chan-Wei Hu, Chengxuan Qian, Huaxiu Yao et al.
The emergence of large Vision Language Models (VLMs) has broadened the scope and capabilities of single-modal Large Language Models (LLMs) by integrating visual modalities, thereby unlocking transformative cross-modal applications in a variety of real-world scenarios. Despite their impressive performance, VLMs are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies. Building on the success of Reinforcement Learning from Human Feedback (RLHF) in aligning LLMs, recent advancements have focused on applying direct preference optimization (DPO) on carefully curated datasets to mitigate these issues. Yet, such approaches typically introduce preference signals in a brute-force manner, neglecting the crucial role of visual information in the alignment process. In this paper, we introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset, effectively incorporating both textual and visual preference signals. We further introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning. Our experimental results demonstrate that Re-Align not only mitigates hallucinations more effectively than previous methods but also yields significant performance gains in general visual question-answering (VQA) tasks. Moreover, we show that Re-Align maintains robustness and scalability across a wide range of VLM sizes and architectures. This work represents a significant step forward in aligning multimodal LLMs, paving the way for more reliable and effective cross-modal applications. We release all the code in https://github.com/taco-group/Re-Align.
Authors' comments: Published at EMNLP 2025
Mohamed Eltahir, Osamah Sarraj, Abdulrahman Alfrihidi, Taha Alshatiri, Mohammed Khurd, Mohammed Bremoo, Tanveer Hussain
Video-to-text and text-to-video retrieval are dominated by English benchmarks (e.g. DiDeMo, MSR-VTT) and recent multilingual corpora (e.g. RUDDER), yet Arabic remains underserved, lacking localized evaluation metrics. We introduce a three-stage framework, AutoArabic, utilizing state-of-the-art large language models (LLMs) to translate non-Arabic benchmarks into Modern Standard Arabic, reducing the manual revision required by nearly fourfold. The framework incorporates an error detection module that automatically flags potential translation errors with 97% accuracy. Applying the framework to DiDeMo, a video retrieval benchmark produces DiDeMo-AR, an Arabic variant with 40,144 fluent Arabic descriptions. An analysis of the translation errors is provided and organized into an insightful taxonomy to guide future Arabic localization efforts. We train a CLIP-style baseline with identical hyperparameters on the Arabic and English variants of the benchmark, finding a moderate performance gap (about 3 percentage points at Recall@1), indicating that Arabic localization preserves benchmark difficulty. We evaluate three post-editing budgets (zero/ flagged-only/ full) and find that performance improves monotonically with more post-editing, while the raw LLM output (zero-budget) remains usable. To ensure reproducibility to other languages, we made the code available at https://github.com/Tahaalshatiri/AutoArabic.
Authors' comments: Accepted at ArabicNLP 2025 (EMNLP 2025 workshop)
Pranjal A. Chitale, Bishal Santra, Yashoteja Prabhu, Amit Sharma
Compact dual-encoder models are widely used for retrieval owing to their efficiency and scalability. However, such models often underperform compared to their Large Language Model (LLM)-based retrieval counterparts, likely due to their limited world knowledge. While LLM-based data augmentation has been proposed as a strategy to bridge this performance gap, there is insufficient understanding of its effectiveness and scalability to real-world retrieval problems. Existing research does not systematically explore key factors such as the optimal augmentation scale, the necessity of using large augmentation models, and whether diverse augmentations improve generalization, particularly in out-of-distribution (OOD) settings. This work presents a comprehensive study of the effectiveness of LLM augmentation for retrieval, comprising over 100 distinct experimental settings of retrieval models, augmentation models and augmentation strategies. We find that, while augmentation enhances retrieval performance, its benefits diminish beyond a certain augmentation scale, even with diverse augmentation strategies. Surprisingly, we observe that augmentation with smaller LLMs can achieve performance competitive with larger augmentation models. Moreover, we examine how augmentation effectiveness varies with retrieval model pre-training, revealing that augmentation provides the most benefit to models which are not well pre-trained. Our insights pave the way for more judicious and efficient augmentation strategies, thus enabling informed decisions and maximizing retrieval performance while being more cost-effective. Code and augmented datasets accompanying this work are publicly available at https://aka.ms/DAGR.
Authors' comments: EMNLP 2025 (MAIN Conference)
Thong Nguyen, Yibin Lei, Jia-Huei Ju, Andrew Yates
Visual Document Retrieval (VDR) typically operates as text-to-image retrieval
using specialized bi-encoders trained to directly embed document images. We
revisit a zero-shot generate-and-encode pipeline: a vision-language model first
produces a detailed textual description of each document image, which is then
embedded by a standard text encoder. On the ViDoRe-v2 benchmark, the method
reaches 63.4% nDCG@5, surpassing the strongest specialised multi-vector visual
document encoder. It also scales better to large collections and offers broader
multilingual coverage. Analysis shows that modern vision-language models
capture complex textual and visual cues with sufficient granularity to act as a
reusable semantic proxy. By offloading modality alignment to pretrained
vision-language models, our approach removes the need for computationally
intensive text-image contrastive training and establishes a strong zero-shot
baseline for future VDR systems.
Authors' comments: Accepted
Jonas Geiger, Marta Moscati, Shah Nawaz, Markus Schedl
Music is characterized by aspects related to different modalities, such as
the audio signal, the lyrics, or the music video clips. This has motivated the
development of multimodal datasets and methods for Music Information Retrieval
(MIR) tasks such as genre classification or autotagging. Music can be described
at different levels of granularity, for instance defining genres at the level
of artists or music albums. However, most datasets for multimodal MIR neglect
this aspect and provide data at the level of individual music tracks. We aim to
fill this gap by providing Music4All Artist and Album (Music4All A+A), a
dataset for multimodal MIR tasks based on music artists and albums. Music4All
A+A is built on top of the Music4All-Onion dataset, an existing track-level
dataset for MIR tasks. Music4All A+A provides metadata, genre labels, image
representations, and textual descriptors for 6,741 artists and 19,511 albums.
Furthermore, since Music4All A+A is built on top of Music4All-Onion, it allows
access to other multimodal data at the track level, including user--item
interaction data. This renders Music4All A+A suitable for a broad range of MIR
tasks, including multimodal music recommendation, at several levels of
granularity. To showcase the use of Music4All A+A, we carry out experiments on
multimodal genre classification of artists and albums, including an analysis in
missing-modality scenarios, and a quantitative comparison with genre
classification in the movie domain. Our experiments show that images are more
informative for classifying the genres of artists and albums, and that several
multimodal models for genre classification struggle in generalizing across
domains. We provide the code to reproduce our experiments at
https://github.com/hcai-mms/Music4All-A-A, the dataset is linked in the
repository and provided open-source under a CC BY-NC-SA 4.0 license.
Authors' comments: 7 pages, 6 tables, IEEE International Conference on Content-Based
Multimedia Indexing (IEEE CBMI)
Yihao Guo, Haocheng Bian, Liutong Zhou, Ze Wang, Zhaoyi Zhang, Francois Kawala, Milan Dean, Ian Fischer et al.
With the deployment of Large Language Models (LLMs) in interactive applications, online malicious intent detection has become increasingly critical. However, existing approaches fall short of handling diverse and complex user queries in real time. To address these challenges, we introduce ADRAG (Adversarial Distilled Retrieval-Augmented Guard), a two-stage framework for robust and efficient online malicious intent detection. In the training stage, a high-capacity teacher model is trained on adversarially perturbed, retrieval-augmented inputs to learn robust decision boundaries over diverse and complex user queries. In the inference stage, a distillation scheduler transfers the teacher's knowledge into a compact student model, with a continually updated knowledge base collected online. At deployment, the compact student model leverages top-K similar safety exemplars retrieved from the online-updated knowledge base to enable both online and real-time malicious query detection. Evaluations across ten safety benchmarks demonstrate that ADRAG, with a 149M-parameter model, achieves 98.5% of WildGuard-7B's performance, surpasses GPT-4 by 3.3% and Llama-Guard-3-8B by 9.5% on out-of-distribution detection, while simultaneously delivering up to 5.6x lower latency at 300 queries per second (QPS) in real-time applications.
Shiyuan Luo, Runlong Yu, Chonghao Qiu, Rahul Ghosh, Robert Ladwig, Paul C. Hanson, Yiqun Xie, Xiaowei Jia
The discovery of environmental knowledge depends on labeled task-specific data, but is often constrained by the high cost of data collection. Existing machine learning approaches usually struggle to generalize in data-sparse or atypical conditions. To this end, we propose an Augmentation-Adaptive Self-Supervised Learning (A$^2$SL) framework, which retrieves relevant observational samples to enhance modeling of the target ecosystem. Specifically, we introduce a multi-level pairwise learning loss to train a scenario encoder that captures varying degrees of similarity among scenarios. These learned similarities drive a retrieval mechanism that supplements a target scenario with relevant data from different locations or time periods. Furthermore, to better handle variable scenarios, particularly under atypical or extreme conditions where traditional models struggle, we design an augmentation-adaptive mechanism that selectively enhances these scenarios through targeted data augmentation. Using freshwater ecosystems as a case study, we evaluate A$^2$SL in modeling water temperature and dissolved oxygen dynamics in real-world lakes. Experimental results show that A$^2$SL significantly improves predictive accuracy and enhances robustness in data-scarce and atypical scenarios. Although this study focuses on freshwater ecosystems, the A$^2$SL framework offers a broadly applicable solution in various scientific domains.