Trieu An, Long Nguyen, Minh Le Nguyen
The Citation Discovery Shared Task focuses on predicting the correct citation
from a given candidate pool for a given paragraph. The main challenges stem
from the length of the abstract paragraphs and the high similarity among
candidate abstracts, making it difficult to determine the exact paper to cite.
To address this, we develop a system that first retrieves the top-k most
similar abstracts based on extracted relational features from the given
paragraph. From this subset, we leverage a Large Language Model (LLM) to
accurately identify the most relevant citation. We evaluate our framework on
the training dataset provided by the SCIDOCA 2025 organizers, demonstrating its
effectiveness in citation prediction.
Authors' comments: In the Proceedings of SCIDOCA 2025
Hoang-An Trieu, Dinh-Truong Do, Chau Nguyen, Vu Tran, Minh Le Nguyen
In recent years, with the appearance of the COVID-19 pandemic, numerous
publications relevant to this disease have been issued. Because of the massive
volume of publications, an efficient retrieval system is necessary to provide
researchers with useful information if an unexpected pandemic happens so
suddenly, like COVID-19. In this work, we present a method to help the
retrieval system, the Covrelex-SE system, to provide more high-quality search
results. We exploited the power of the large language models (LLMs) to extract
the hidden relationships inside the unlabeled publication that cannot be found
by the current parsing tools that the system is using. Since then, help the
system to have more useful information during retrieval progress.
Authors' comments: In the Proceedings of SCIDOCA 2024
Mihailo Stojnic
Companion paper [118] developed a powerful \emph{Random duality theory} (RDT) based analytical program to statistically characterize performance of \emph{descending} phase retrieval algorithms (dPR) (these include all variants of gradient descents and among them widely popular Wirtinger flows). We here generalize the program and show how it can be utilized to handle rank $d$ positive definite phase retrieval (PR) measurements (with special cases $d=1$ and $d=2$ serving as emulations of the real and complex phase retrievals, respectively). In particular, we observe that the minimal sample complexity ratio (number of measurements scaled by the dimension of the unknown signal) which ensures dPR's success exhibits a phase transition (PT) phenomenon. For both plain and lifted RDT we determine phase transitions locations. To complement theoretical results we implement a log barrier gradient descent variant and observe that, even in small dimensional scenarios (with problem sizes on the order of 100), the simulated phase transitions are in an excellent agreement with the theoretical predictions.
Mihailo Stojnic
We analyze the relation between spectral initializers and theoretical limits of \emph{descending} phase retrieval algorithms (dPR). In companion paper [104], for any sample complexity ratio, $\alpha$, \emph{parametric manifold}, ${\mathcal {PM}}(\alpha)$, is recognized as a critically important structure that generically determines dPRs abilities to solve phase retrieval (PR). Moreover, overlap between the algorithmic solution and the true signal is positioned as a key ${\mathcal {PM}}$'s component. We here consider the so-called \emph{overlap optimal} spectral initializers (OptSpins) as dPR's starting points and develop a generic \emph{Random duality theory} (RDT) based program to statistically characterize them. In particular, we determine the functional structure of OptSpins and evaluate the starting overlaps that they provide for the dPRs. Since ${\mathcal {PM}}$'s so-called \emph{flat regions} are highly susceptible to \emph{local jitteriness} and as such are key obstacles on dPR's path towards PR's global optimum, a precise characterization of the starting overlap allows to determine if such regions can be successfully circumvented. Through the presented theoretical analysis we observe two key points in that regard: \textbf{\emph{(i)}} dPR's theoretical phase transition (critical $\alpha$ above which they solve PR) might be difficult to practically achieve as the ${\mathcal {PM}}$'s flat regions are large causing the associated OptSpins to fall exactly within them; and \textbf{\emph{(ii)}} Opting for so-called ``\emph{safer compression}'' and slightly increasing $\alpha$ (by say $15\%$) shrinks flat regions and allows OptSpins to fall outside them and dPRs to ultimately solve PR. Numerical simulations are conducted as well and shown to be in an excellent agreement with theoretical predictions.
Xiangzhao Hao, Kuan Zhu, Hongyu Guo, Haiyun Guo, Ming Tang, JinQiao Wang
Natural language querying of visual content underpins many vision-language tasks, typically categorized by text granularity and visual search scope. Text-Image Retrieval (TIR) retrieves whole images using coarse descriptions, while Referring Expression Comprehension (REC) localizes objects using fine-grained expressions within a single image. However, real-world scenarios often require both instance-level retrieval and localization across large galleries -- tasks where TIR lacks precision and REC lacks scalability. To address this gap, we propose a new task: Referring Expression Instance Retrieval (REIR), which jointly supports instance-level retrieval and localization. We introduce REIRCOCO, a large-scale benchmark constructed by prompting vision-language models to generate fine-grained expressions for MSCOCO and RefCOCO instances. We also present a baseline method, CLARE, featuring a dual-stream architecture with a Mix of Relation Experts (MORE) module for capturing inter-instance relationships. CLARE integrates object detection and REC pretraining with Contrastive Language-Instance Alignment (CLIA) for end-to-end optimization. Experiments show that CLARE achieves state-of-the-art performance on REIR and generalizes well to TIR and REC, highlighting its effectiveness and versatility.
Tam Trinh, Manh Nguyen, Truong-Son Hy
The rapid spread of misinformation in the digital era poses significant challenges to public discourse, necessitating robust and scalable fact-checking solutions. Traditional human-led fact-checking methods, while credible, struggle with the volume and velocity of online content, prompting the integration of automated systems powered by Large Language Models (LLMs). However, existing automated approaches often face limitations, such as handling complex claims, ensuring source credibility, and maintaining transparency. This paper proposes a novel multi-agent system for automated fact-checking that enhances accuracy, efficiency, and explainability. The system comprises four specialized agents: an Input Ingestion Agent for claim decomposition, a Query Generation Agent for formulating targeted subqueries, an Evidence Retrieval Agent for sourcing credible evidence, and a Verdict Prediction Agent for synthesizing veracity judgments with human-interpretable explanations. Evaluated on benchmark datasets (FEVEROUS, HOVER, SciFact), the proposed system achieves a 12.3% improvement in Macro F1-score over baseline methods. The system effectively decomposes complex claims, retrieves reliable evidence from trusted sources, and generates transparent explanations for verification decisions. Our approach contributes to the growing field of automated fact-checking by providing a more accurate, efficient, and transparent verification methodology that aligns with human fact-checking practices while maintaining scalability for real-world applications. Our source code is available at https://github.com/HySonLab/FactAgent
Wenhao Li, Hongkuan Zhang, Hongwei Zhang, Zhengxu Li, Zengjie Dong, Yafan Chen, Niranjan Bidargaddi, Hong Liu
Current medical language models, adapted from large language models (LLMs), typically predict ICD code-based diagnosis from electronic health records (EHRs) because these labels are readily available. However, ICD codes do not capture the nuanced, context-rich reasoning clinicians use for diagnosis. Clinicians synthesize diverse patient data and reference clinical practice guidelines (CPGs) to make evidence-based decisions. This misalignment limits the clinical utility of existing models. We introduce GARMLE-G, a Generation-Augmented Retrieval framework that grounds medical language model outputs in authoritative CPGs. Unlike conventional Retrieval-Augmented Generation based approaches, GARMLE-G enables hallucination-free outputs by directly retrieving authoritative guideline content without relying on model-generated text. It (1) integrates LLM predictions with EHR data to create semantically rich queries, (2) retrieves relevant CPG knowledge snippets via embedding similarity, and (3) fuses guideline content with model output to generate clinically aligned recommendations. A prototype system for hypertension diagnosis was developed and evaluated on multiple metrics, demonstrating superior retrieval precision, semantic relevance, and clinical guideline adherence compared to RAG-based baselines, while maintaining a lightweight architecture suitable for localized healthcare deployment. This work provides a scalable, low-cost, and hallucination-free method for grounding medical language models in evidence-based clinical practice, with strong potential for broader clinical deployment.
Catarina Pires, Sérgio Nunes, Luís Filipe Teixeira
Evaluating Information Retrieval (IR) systems relies on high-quality manual
relevance judgments (qrels), which are costly and time-consuming to obtain.
While pooling reduces the annotation effort, it results in only partially
labeled datasets. Large Language Models (LLMs) offer a promising alternative to
reducing reliance on manual judgments, particularly in complex domains like
medical case-based retrieval, where relevance assessment requires analyzing
both textual and visual information. In this work, we explore using a
Multimodal Large Language Model (MLLM) to expand relevance judgments, creating
a new dataset of automated judgments. Specifically, we employ Gemini 1.5 Pro on
the ImageCLEFmed 2013 case-based retrieval task, simulating human assessment
through an iteratively refined, structured prompting strategy that integrates
binary scoring, instruction-based evaluation, and few-shot learning. We
systematically experimented with various prompt configurations to maximize
agreement with human judgments. To evaluate agreement between the MLLM and
human judgments, we use Cohen's Kappa, achieving a substantial agreement score
of 0.6, comparable to inter-annotator agreement typically observed in
multimodal retrieval tasks. Starting from the original 15,028 manual judgments
(4.72% relevant) across 35 topics, our MLLM-based approach expanded the dataset
by over 37x to 558,653 judgments, increasing relevant annotations to 5,950. On
average, each medical case query received 15,398 new annotations, with
approximately 99% being non-relevant, reflecting the high sparsity typical in
this domain. Our results demonstrate the potential of MLLMs to scale relevance
judgment collection, offering a promising direction for supporting retrieval
evaluation in medical and multimodal IR tasks.
Authors' comments: To appear at the Third Workshop on Large Language Models for
Evaluation in Information Retrieval (LLM4Eval 2025), co-located with SIGIR
2025. 9 pages, 2 figures, 5 tables
Bowen Wang
The real value of knowledge lies not just in its accumulation, but in its potential to be harnessed effectively to conquer the unknown. Although recent multimodal large language models (MLLMs) exhibit impressing multimodal capabilities, they often fail in rarely encountered domain-specific tasks due to limited relevant knowledge. To explore this, we adopt visual game cognition as a testbed and select Monster Hunter: World as the target to construct a multimodal knowledge graph (MH-MMKG), which incorporates multi-modalities and intricate entity relations. We also design a series of challenging queries based on MH-MMKG to evaluate the models' ability for complex knowledge retrieval and reasoning. Furthermore, we propose a multi-agent retriever that enables a model to autonomously search relevant knowledge without additional training. Experimental results show that our approach significantly enhances the performance of MLLMs, providing a new perspective on multimodal knowledge-augmented reasoning and laying a solid foundation for future research.
Yunhan Ren, Feng Luo, Siyu Huang
While existing Generalized Category Discovery (GCD) models have achieved
significant success, their performance with limited labeled samples and a small
number of known categories remains largely unexplored. In this work, we
introduce the task of Few-shot Generalized Category Discovery (FSGCD), aiming
to achieve competitive performance in GCD tasks under conditions of known
information scarcity. To tackle this challenge, we propose a decision boundary
enhancement framework with affinity-based retrieval. Our framework is designed
to learn the decision boundaries of known categories and transfer these
boundaries to unknown categories. First, we use a decision boundary
pre-training module to mitigate the overfitting of pre-trained information on
known category boundaries and improve the learning of these decision boundaries
using labeled samples. Second, we implement a two-stage retrieval-guided
decision boundary optimization strategy. Specifically, this strategy further
enhances the severely limited known boundaries by using affinity-retrieved
pseudo-labeled samples. Then, these refined boundaries are applied to unknown
clusters via guidance from affinity-based feature retrieval. Experimental
results demonstrate that our proposed method outperforms existing methods on
six public GCD benchmarks under the FSGCD setting. The codes are available at:
https://github.com/Ryh1218/FSGCD
Authors' comments: Accepted by ICMR 2025
Chenxu Wang, Yonggang Jin, Cheng Hu, Youpeng Zhao, Zipeng Dai, Jian Zhao, Shiyu Huang, Liuyu Xiang et al.
Adapting a single agent to a new multi-agent system brings challenges,
necessitating adjustments across various tasks, environments, and interactions
with unknown teammates and opponents. Addressing this challenge is highly
complex, and researchers have proposed two simplified scenarios, Multi-agent
reinforcement learning for zero-shot learning and Ad-Hoc Teamwork. Building on
these foundations, we propose a more comprehensive setting, Agent
Collaborative-Competitive Adaptation (ACCA), which evaluates an agent to
generalize across diverse scenarios, tasks, and interactions with both
unfamiliar opponents and teammates. In ACCA, agents adjust to task and
environmental changes, collaborate with unseen teammates, and compete against
unknown opponents. We introduce a new modeling approach, Multi-Retrieval and
Dynamic Generation (MRDG), that effectively models both teammates and opponents
using their behavioral trajectories. This method incorporates a positional
encoder for varying team sizes and a hypernetwork module to boost agents'
learning and adaptive capabilities. Additionally, a viewpoint alignment module
harmonizes the observational perspectives of retrieved teammates and opponents
with the learning agent. Extensive tests in benchmark scenarios like SMAC,
Overcooked-AI, and Melting Pot show that MRDG significantly improves robust
collaboration and competition with unseen teammates and opponents, surpassing
established baselines. Our code is available at:
https://github.com/vcis-wangchenxu/MRDG.git
Authors' comments: This manuscript is under submission to Neurocomputing
Kangqi Chen, Andreas Kosmas Kakolyris, Rakesh Nadig, Manos Frouzakis, Nika Mansouri Ghiasi, Yu Liang, Haiyu Mao, Jisung Park et al.
Large Language Models (LLMs) face an inherent challenge: their knowledge is
confined to the data that they have been trained on. To overcome this issue,
Retrieval-Augmented Generation (RAG) complements the static training-derived
knowledge of LLMs with an external knowledge repository. RAG consists of three
stages: indexing, retrieval, and generation. The retrieval stage of RAG becomes
a significant bottleneck in inference pipelines. In this stage, a user query is
mapped to an embedding vector and an Approximate Nearest Neighbor Search (ANNS)
algorithm searches for similar vectors in the database to identify relevant
items. Due to the large database sizes, ANNS incurs significant data movement
overheads between the host and the storage system. To alleviate these
overheads, prior works propose In-Storage Processing (ISP) techniques that
accelerate ANNS by performing computations inside storage. However, existing
works that leverage ISP for ANNS (i) employ algorithms that are not tailored to
ISP systems, (ii) do not accelerate data retrieval operations for data selected
by ANNS, and (iii) introduce significant hardware modifications, limiting
performance and hindering their adoption. We propose REIS, the first ISP system
tailored for RAG that addresses these limitations with three key mechanisms.
First, REIS employs a database layout that links database embedding vectors to
their associated documents, enabling efficient retrieval. Second, it enables
efficient ANNS by introducing an ISP-tailored data placement technique that
distributes embeddings across the planes of the storage system and employs a
lightweight Flash Translation Layer. Third, REIS leverages an ANNS engine that
uses the existing computational resources inside the storage system. Compared
to a server-grade system, REIS improves the performance (energy efficiency) of
retrieval by an average of 13x (55x).
Authors' comments: Extended version of our publication at the 52nd International
Symposium on Computer Architecture (ISCA-52), 2025
Chao He, Hongxi Wei
Deep image hashing aims to enable effective large-scale image retrieval by
mapping the input images into simple binary hash codes through deep neural
networks. More recently, Vision Mamba with linear time complexity has attracted
extensive attention from researchers by achieving outstanding performance on
various computer tasks. Nevertheless, the suitability of Mamba for large-scale
image retrieval tasks still needs to be explored. Towards this end, we propose
a visual state space hashing model, called MambaHash. Concretely, we propose a
backbone network with stage-wise architecture, in which grouped Mamba operation
is introduced to model local and global information by utilizing Mamba to
perform multi-directional scanning along different groups of the channel.
Subsequently, the proposed channel interaction attention module is used to
enhance information communication across channels. Finally, we meticulously
design an adaptive feature enhancement module to increase feature diversity and
enhance the visual representation capability of the model. We have conducted
comprehensive experiments on three widely used datasets: CIFAR-10, NUS-WIDE and
IMAGENET. The experimental results demonstrate that compared with the
state-of-the-art deep hashing methods, our proposed MambaHash has well
efficiency and superior performance to effectively accomplish large-scale image
retrieval tasks. Source code is available
https://github.com/shuaichaochao/MambaHash.git
Authors' comments: Accepted by ICMR2025. arXiv admin note: text overlap with
arXiv:2405.07524
Yuan Zhang, Chun-Kai Fan, Tao Huang, Ming Lu, Sicheng Yu, Junwen Pan, Kuan Cheng, Qi She et al.
Inspired by text prompts in large language models (LLMs), visual prompts have
been explored to enhance the reasoning capabilities of large vision-language
models (LVLMs). Current methods design heuristic visual prompts, such as
overlaying a text-query-guided attention heatmap on the original input image.
However, designing effective prompts manually is challenging and
time-consuming, and it often fails to explore the benefits of different visual
prompts, leading to sub-optimal performance. To this end, we propose
\textbf{AutoV} that learns to automatically select the optimal visual prompt
from various candidates based on given textual queries and the input image. To
train AutoV, we developed an automatic data collection and labeling pipeline
that evaluates various visual prompts with a pre-trained LVLM. We input a set
of visual prompts into the LVLM and rank them according to the prediction
losses generated by the model. Using the ranking as a supervision signal, we
train AutoV to automatically choose the optimal visual prompt from various
visual prompts for LVLMs. Experimental results indicate that AutoV enhances the
performance of various LVLMs across multiple popular image understanding tasks.
For instance, LLaVA-OV with AutoV achieves $\textbf{1.7}\%$ accuracy gain on
LLaVA$^{\text{Wild}}$, and AutoV boosts Qwen2.5-VL by $\textbf{1.9}\%$ on MMMU,
highlighting its potential as an optimal visual prompting method for LVLMs.
Authors' comments: 19 pages
Xinyue Huang, Ziqi Lin, Fang Sun, Wenchao Zhang, Kejian Tong, Yunbo Liu
This paper presents a novel Retrieval-Augmented Generation (RAG) framework tailored for complex question answering tasks, addressing challenges in multi-hop reasoning and contextual understanding across lengthy documents. Built upon LLaMA 3, the framework integrates a dense retrieval module with advanced context fusion and multi-hop reasoning mechanisms, enabling more accurate and coherent response generation. A joint optimization strategy combining retrieval likelihood and generation cross-entropy improves the model's robustness and adaptability. Experimental results show that the proposed system outperforms existing retrieval-augmented and generative baselines, confirming its effectiveness in delivering precise, contextually grounded answers.
Duong Bach
Multi-vector document retrieval systems, such as ColPali, excel in fine-grained matching for complex queries but incur significant storage and computational costs due to their reliance on high-dimensional patch embeddings and late-interaction scoring. To address these challenges, we propose HPC-ColPali, a Hierarchical Patch Compression framework that enhances the efficiency of ColPali while preserving its retrieval accuracy. Our approach integrates three innovative techniques: (1) K-Means quantization, which compresses patch embeddings into 1-byte centroid indices, achieving up to 32$\times$ storage reduction; (2) attention-guided dynamic pruning, utilizing Vision-Language Model attention weights to retain only the top-$p\%$ most salient patches, reducing late-interaction computation by up to 60\% with less than 2\% nDCG@10 loss; and (3) optional binary encoding of centroid indices into $b$-bit strings ($b=\lceil\log_2 K\rceil$), enabling rapid Hamming distance-based similarity search for resource-constrained environments. Evaluated on the ViDoRe and SEC-Filings datasets, HPC-ColPali achieves 30--50\% lower query latency under HNSW indexing while maintaining high retrieval precision. When integrated into a Retrieval-Augmented Generation pipeline for legal summarization, it reduces hallucination rates by 30\% and halves end-to-end latency. These advancements establish HPC-ColPali as a scalable and efficient solution for multi-vector document retrieval across diverse applications. Code is available at https://github.com/DngBack/HPC-ColPali.
Authors' comments: 9 pages
Jushaan Singh Kalra, Xinran Zhao, To Eun Kim, Fengyu Cai, Fernando Diaz, Tongshuang Wu
Retrieval-augmented Generation (RAG) is powerful, but its effectiveness
hinges on which retrievers we use and how. Different retrievers offer distinct,
often complementary signals: BM25 captures lexical matches; dense retrievers,
semantic similarity. Yet in practice, we typically fix a single retriever based
on heuristics, which fails to generalize across diverse information needs. Can
we dynamically select and integrate multiple retrievers for each individual
query, without the need for manual selection? In our work, we validate this
intuition with quantitative analysis and introduce mixture of retrievers: a
zero-shot, weighted combination of heterogeneous retrievers. Extensive
experiments show that such mixtures are effective and efficient: Despite
totaling just 0.8B parameters, this mixture outperforms every individual
retriever and even larger 7B models by +10.8% and +3.9% on average,
respectively. Further analysis also shows that this mixture framework can help
incorporate specialized non-oracle human information sources as retrievers to
achieve good collaboration, with a 58.9% relative performance improvement over
simulated humans alone.
Authors' comments: 19 pages, 3 figures
Yu-Ting Lan, Yang Huo, Yi Shen, Xiao Yang, Zuotao Liu
The item cold-start problem is critical for online recommendation systems, as the success of this phase determines whether high-quality new items can transition to popular ones, receive essential feedback to inspire creators, and thus lead to the long-term retention of creators. However, modern recommendation systems still struggle to address item cold-start challenges due to the heavy reliance on item and historical interactions, which are non-trivial for cold-start items lacking sufficient exposure and feedback. Lookalike algorithms provide a promising solution by extending feedback for new items based on lookalike users. Traditional lookalike algorithms face such limitations: (1) failing to effectively model the lookalike users and further improve recommendations with the existing rule- or model-based methods; and (2) struggling to utilize the interaction signals and incorporate diverse features in modern recommendation systems. Inspired by lookalike algorithms, we propose Next-User Retrieval, a novel framework for enhancing cold-start recommendations via generative next-user modeling. Specifically, we employ a transformer-based model to capture the unidirectional relationships among recently interacted users and utilize these sequences to generate the next potential user who is most likely to interact with the item. The additional item features are also integrated as prefix prompt embeddings to assist the next-user generation. The effectiveness of Next-User Retrieval is evaluated through both offline experiments and online A/B tests. Our method achieves significant improvements with increases of 0.0142% in daily active users and +0.1144% in publications in Douyin, showcasing its practical applicability and scalability.
Yang Fan, Zhang Qi, Xing Wenqian, Liu Chang, Liu Liu
This article addresses domain knowledge gaps in general large language models for historical text analysis in the context of computational humanities and AIGC technology. We propose the Graph RAG framework, combining chain-of-thought prompting, self-instruction generation, and process supervision to create a The First Four Histories character relationship dataset with minimal manual annotation. This dataset supports automated historical knowledge extraction, reducing labor costs. In the graph-augmented generation phase, we introduce a collaborative mechanism between knowledge graphs and retrieval-augmented generation, improving the alignment of general models with historical knowledge. Experiments show that the domain-specific model Xunzi-Qwen1.5-14B, with Simplified Chinese input and chain-of-thought prompting, achieves optimal performance in relation extraction (F1 = 0.68). The DeepSeek model integrated with GraphRAG improves F1 by 11% (0.08-0.19) on the open-domain C-CLUE relation extraction dataset, surpassing the F1 value of Xunzi-Qwen1.5-14B (0.12), effectively alleviating hallucinations phenomenon, and improving interpretability. This framework offers a low-resource solution for classical text knowledge extraction, advancing historical knowledge services and humanities research.
Isabella Mastroianni, Marco Guerra, Ulderico Fugacci, Emanuela De Negri
Motivated by the need to relate the biparameter persistence module induced by a pair of scalar functions with the monoparameter persistence modules induced by each function separately, we introduce a construction that defines a kind of product between two monoparameter persistence modules. While originally conceived to serve this comparative purpose, our construction unexpectedly reveals a deeper structural property: it also characterizes a class of biparameter modules known as hook-decomposable modules.