Karen Fonseca, Leon Suarez-Rodriguez, Andres Jerez, Felipe Gutierrez-Barragan, Henry Arguello
Phase retrieval (PR) reconstructs phase information from magnitude measurements, known as coded diffraction patterns (CDPs), whose quality depends on the number of snapshots captured using coded phase masks. High-quality phase estimation requires multiple snapshots, which is not desired for efficient PR systems. End-to-end frameworks enable joint optimization of the optical system and the recovery neural network. However, their application is constrained by physical implementation limitations. Additionally, the framework is prone to gradient vanishing issues related to its global optimization process. This paper introduces a Knowledge Distillation (KD) optimization approach to address these limitations. KD transfers knowledge from a larger, lower-constrained network (teacher) to a smaller, more efficient, and implementable network (student). In this method, the teacher, a PR system trained with multiple snapshots, distills its knowledge into a single-snapshot PR system, the student. The loss functions compare the CPMs and the feature space of the recovery network. Simulations demonstrate that this approach improves reconstruction performance compared to a PR system trained without the teacher's guidance.
Authors' comments: Accepted on the IEEE International Conference on Image Processing, IEEE ICIP 2025
Yayu Long, Kewei Chen, Long Jin, Mingsheng Shang
We introduce Dynamic Retrieval-Augmented Expert Networks (DRAE), a
groundbreaking architecture that addresses the challenges of lifelong learning,
catastrophic forgetting, and task adaptation by combining the dynamic routing
capabilities of Mixture-of-Experts (MoE); leveraging the knowledge-enhancement
power of Retrieval-Augmented Generation (RAG); incorporating a novel
hierarchical reinforcement learning (RL) framework; and coordinating through
ReflexNet-SchemaPlanner-HyperOptima (RSHO).DRAE dynamically routes expert
models via a sparse MoE gating mechanism, enabling efficient resource
allocation while leveraging external knowledge through parametric retrieval
(P-RAG) to augment the learning process. We propose a new RL framework with
ReflexNet for low-level task execution, SchemaPlanner for symbolic reasoning,
and HyperOptima for long-term context modeling, ensuring continuous adaptation
and memory retention. Experimental results show that DRAE significantly
outperforms baseline approaches in long-term task retention and knowledge
reuse, achieving an average task success rate of 82.5% across a set of dynamic
robotic manipulation tasks, compared to 74.2% for traditional MoE models.
Furthermore, DRAE maintains an extremely low forgetting rate, outperforming
state-of-the-art methods in catastrophic forgetting mitigation. These results
demonstrate the effectiveness of our approach in enabling flexible, scalable,
and efficient lifelong learning for robotics.
Authors' comments: Accepted to the main conference of the Annual Meeting of the
Association for Computational Linguistics (ACL 2025)
Costas Mavromatis, Soji Adeshina, Vassilis N. Ioannidis, Zhen Han, Qi Zhu, Ian Robinson, Bryan Thompson, Huzefa Rangwala et al.
Knowledge graph question answering (KGQA) presents significant challenges due to the structural and semantic variations across input graphs. Existing works rely on Large Language Model (LLM) agents for graph traversal and retrieval; an approach that is sensitive to traversal initialization, as it is prone to entity linking errors and may not generalize well to custom ("bring-your-own") KGs. We introduce BYOKG-RAG, a framework that enhances KGQA by synergistically combining LLMs with specialized graph retrieval tools. In BYOKG-RAG, LLMs generate critical graph artifacts (question entities, candidate answers, reasoning paths, and OpenCypher queries), and graph tools link these artifacts to the KG and retrieve relevant graph context. The retrieved context enables the LLM to iteratively refine its graph linking and retrieval, before final answer generation. By retrieving context from different graph tools, BYOKG-RAG offers a more general and robust solution for QA over custom KGs. Through experiments on five benchmarks spanning diverse KG types, we demonstrate that BYOKG-RAG outperforms the second-best graph retrieval method by 4.5% points while showing better generalization to custom KGs. BYOKG-RAG framework is open-sourced at https://github.com/awslabs/graphrag-toolkit.
Xinyi Wu, Yanhao Jia, Luwei Xiao, Shuai Zhao, Fengkuang Chiang, Erik Cambria
In AI-facilitated teaching, leveraging various query styles to interpret abstract educational content is crucial for delivering effective and accessible learning experiences. However, existing retrieval systems predominantly focus on natural text-image matching and lack the capacity to address the diversity and ambiguity inherent in real-world educational scenarios. To address this limitation, we develop a lightweight and efficient multi-modal retrieval module, named Uni-Retrieval, which extracts query-style prototypes and dynamically matches them with tokens from a continually updated Prompt Bank. This Prompt Bank encodes and stores domain-specific knowledge by leveraging a Mixture-of-Expert Low-Rank Adaptation (MoE-LoRA) module and can be adapted to enhance Uni-Retrieval's capability to accommodate unseen query types at test time. To enable natural language educational content generation, we integrate the original Uni-Retrieval with a compact instruction-tuned language model, forming a complete retrieval-augmented generation pipeline named Uni-RAG. Given a style-conditioned query, Uni-RAG first retrieves relevant educational materials and then generates human-readable explanations, feedback, or instructional content aligned with the learning objective. Experimental results on SER and other multi-modal benchmarks show that Uni-RAG outperforms baseline retrieval and RAG systems in both retrieval accuracy and generation quality, while maintaining low computational cost. Our framework provides a scalable, pedagogically grounded solution for intelligent educational systems, bridging retrieval and generation to support personalized, explainable, and efficient learning assistance across diverse STEM scenarios.
Sarat Ahmad, Zeinab Nezami, Maryam Hafeez, Syed Ali Raza Zaidi
Generative AI (GenAI) is expected to play a pivotal role in enabling autonomous optimization in future wireless networks. Within the ORAN architecture, Large Language Models (LLMs) can be specialized to generate xApps and rApps by leveraging specifications and API definitions from the RAN Intelligent Controller (RIC) platform. However, fine-tuning base LLMs for telecom-specific tasks remains expensive and resource-intensive. Retrieval-Augmented Generation (RAG) offers a practical alternative through in-context learning, enabling domain adaptation without full retraining. While traditional RAG systems rely on vector-based retrieval, emerging variants such as GraphRAG and Hybrid GraphRAG incorporate knowledge graphs or dual retrieval strategies to support multi-hop reasoning and improve factual grounding. Despite their promise, these methods lack systematic, metric-driven evaluations, particularly in high-stakes domains such as ORAN. In this study, we conduct a comparative evaluation of Vector RAG, GraphRAG, and Hybrid GraphRAG using ORAN specifications. We assess performance across varying question complexities using established generation metrics: faithfulness, answer relevance, context relevance, and factual correctness. Results show that both GraphRAG and Hybrid GraphRAG outperform traditional RAG. Hybrid GraphRAG improves factual correctness by 8%, while GraphRAG improves context relevance by 7%.
Devendra Patel, Aaditya Jain, Jayant Verma, Divyansh Rajput, Sunil Mahala, Ketki Suresh Khapare, Jayateja Kalla
We present NDAI-NeuroMAP, the first neuroscience-domain-specific dense vector
embedding model engineered for high-precision information retrieval tasks. Our
methodology encompasses the curation of an extensive domain-specific training
corpus comprising 500,000 carefully constructed triplets
(query-positive-negative configurations), augmented with 250,000
neuroscience-specific definitional entries and 250,000 structured
knowledge-graph triplets derived from authoritative neurological ontologies. We
employ a sophisticated fine-tuning approach utilizing the
FremyCompany/BioLORD-2023 foundation model, implementing a multi-objective
optimization framework combining contrastive learning with triplet-based metric
learning paradigms. Comprehensive evaluation on a held-out test dataset
comprising approximately 24,000 neuroscience-specific queries demonstrates
substantial performance improvements over state-of-the-art general-purpose and
biomedical embedding models. These empirical findings underscore the critical
importance of domain-specific embedding architectures for neuroscience-oriented
RAG systems and related clinical natural language processing applications.
Authors' comments: The document consists of 15 pages in total: the first 13 pages
comprise the main paper, while the last two pages contain supplementary
material
Sanjit Krishna
I propose a novel framework for quantum image storage using
continuous-variable (CV) photonic systems. Unlike traditional qubit-based
approaches, this model encodes grayscale image intensities into qumodes via
coherent-state displacement operators. A delta evolution mechanism enables
memory efficient storage by recording only intensity shifts between frames. To
support scalable retrieval, I introduce entropy based frame indexing using von
Neumann entropy. The proposed system is simulated using Strawberry Fields,
demonstrating partial fidelity preservation and coherent phase-space behavior
via Wigner function visualization. This approach offers a promising pathway
toward scalable, photonic-compatible quantum memory models for quantum vision
and imaging applications.
Authors' comments: 17 pages,6 figures, Early stage idea; feedback welcome
Congmin Min, Rhea Mathew, Joyce Pan, Sahil Bansal, Abbas Keshavarzi, Amar Viswanathan Kannan
We propose a scalable and cost-efficient framework for deploying Graph-based Retrieval Augmented Generation (GraphRAG) in enterprise environments. While GraphRAG has shown promise for multi-hop reasoning and structured retrieval, its adoption has been limited by the high computational cost of constructing knowledge graphs using large language models (LLMs) and the latency of graph-based retrieval. To address these challenges, we introduce two core innovations: (1) a dependency-based knowledge graph construction pipeline that leverages industrial-grade NLP libraries to extract entities and relations from unstructured text completely eliminating reliance on LLMs; and (2) a lightweight graph retrieval strategy that combines hybrid query node identification with efficient one-hop traversal for high-recall, low-latency subgraph extraction. We evaluate our framework on two SAP datasets focused on legacy code migration and demonstrate strong empirical performance. Our system achieves up to 15% and 4.35% improvements over traditional RAG baselines based on LLM-as-Judge and RAGAS metrics, respectively. Moreover, our dependency-based construction approach attains 94% of the performance of LLM-generated knowledge graphs (61.87% vs. 65.83%) while significantly reducing cost and improving scalability. These results validate the feasibility of deploying GraphRAG systems in real-world, large-scale enterprise applications without incurring prohibitive resource requirements paving the way for practical, explainable, and domain-adaptable retrieval-augmented reasoning.
Zecheng Zhao, Selena Song, Tong Chen, Zhi Chen, Shazia Sadiq, Yadan Luo
Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation
metrics primarily capture visual quality and temporal consistency, offering
limited insight into how synthetic videos perform in downstream tasks such as
text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset
and benchmark designed to evaluate the utility of synthetic videos for building
retrieval models. Based on 800 diverse user queries derived from MSRVTT
training split, we generate synthetic videos using state-of-the-art T2V models
and annotate each video-text pair along four key semantic alignment dimensions:
Object \& Scene, Action, Attribute, and Prompt Fidelity. Our evaluation
framework correlates general video quality assessment (VQA) metrics with these
alignment scores, and examines their predictive power for downstream TVR
performance. To explore pathways of scaling up, we further develop an
Auto-Evaluator to estimate alignment quality from existing metrics. Beyond
benchmarking, our results show that SynTVA is a valuable asset for dataset
augmentation, enabling the selection of high-utility synthetic samples that
measurably improve TVR outcomes. Project page and dataset can be found at
https://jasoncodemaker.github.io/SynTVA/.
Authors' comments: 7 pages, 10 figures
Chloe Ho, Ishneet Sukhvinder Singh, Diya Sharma, Tanvi Reddy Anumandla, Michael Lu, Vasu Sharma, Kevin Zhu
Search algorithms and user query relevance have given LLMs the ability to return relevant information, but the effect of content phrasing on ad visibility remains underexplored. We investigate how LLM-based rewriting of advertisements can improve their ranking in retrieval systems and inclusion in generated LLM responses, without modifying the retrieval model itself. We introduce a supervised fine-tuning framework with a custom loss balancing semantic relevance and content fidelity. To evaluate effectiveness, we propose two metrics: DeltaMRR@K (ranking improvement) and DeltaDIR@K (inclusion frequency improvement). Our approach presents a scalable method to optimize ad phrasing, enhancing visibility in retrieval-based LLM workflows. Experiments across both instruction-based and few-shot prompting demonstrate that PPO trained models outperform both prompt engineering and supervised fine-tuning in most cases, achieving up to a 2.79 DeltaDIR@5 and 0.0073 DeltaMRR@5 in instruction-based prompting. These results highlight the importance of how the ad is written before retrieval and prompt format and reinforcement learning in effective ad rewriting for LLM integrated retrieval systems.
William A. Ingram, Bipasha Banerjee, Edward A. Fox
Large language models (LLMs) are increasingly used to assign document
relevance labels in information retrieval pipelines, especially in domains
lacking human-labeled data. However, different models often disagree on
borderline cases, raising concerns about how such disagreement affects
downstream retrieval. This study examines labeling disagreement between two
open-weight LLMs, LLaMA and Qwen, on a corpus of scholarly abstracts related to
Sustainable Development Goals (SDGs) 1, 3, and 7. We isolate disagreement
subsets and examine their lexical properties, rank-order behavior, and
classification predictability. Our results show that model disagreement is
systematic, not random: disagreement cases exhibit consistent lexical patterns,
produce divergent top-ranked outputs under shared scoring functions, and are
distinguishable with AUCs above 0.74 using simple classifiers. These findings
suggest that LLM-based filtering introduces structured variability in document
retrieval, even under controlled prompting and shared ranking logic. We propose
using classification disagreement as an object of analysis in retrieval
evaluation, particularly in policy-relevant or thematic search tasks.
Authors' comments: Presented at LLM4Eval Workshop, SIGIR 2025 Padova, Italy, July 17,
2025
Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun
Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at https://github.com/smart-lty/LogitSpec.
Juan Chen, Baolong Bi, Wei Zhang, Jingyan Sui, Xiaofei Zhu, Yuanzhuo Wang, Lingrui Mei, Shenghua Liu
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating their parametric knowledge with external retrieved content. However, knowledge conflicts caused by internal inconsistencies or noisy retrieved content can severely undermine the generation reliability of RAG systems.In this work, we argue that LLMs should rethink all evidence, including both retrieved content and internal knowledge, before generating responses.We propose CARE-RAG (Conflict-Aware and Reliable Evidence for RAG), a novel framework that improves trustworthiness through Conflict-Driven Summarization of all available evidence.CARE-RAG first derives parameter-aware evidence by comparing parameter records to identify diverse internal perspectives. It then refines retrieved evidences to produce context-aware evidence, removing irrelevant or misleading content. To detect and summarize conflicts, we distill a 3B LLaMA3.2 model to perform conflict-driven summarization, enabling reliable synthesis across multiple sources.To further ensure evaluation integrity, we introduce a QA Repair step to correct outdated or ambiguous benchmark answers.Experiments on revised QA datasets with retrieval data show that CARE-RAG consistently outperforms strong RAG baselines, especially in scenarios with noisy or conflicting evidence.
Michał Matak, Jarosław A. Chudziak
In this paper we discuss the capability of large language models to base
their answer and provide proper references when dealing with legal matters of
non-english and non-chinese speaking country. We discuss the history of legal
information retrieval, the difference between case law and statute law, its
impact on the legal tasks and analyze the latest research in this field. Basing
on that background we introduce gAIus, the architecture of the cognitive
LLM-based agent, whose responses are based on the knowledge retrieved from
certain legal act, which is Polish Civil Code. We propose a retrieval mechanism
which is more explainable, human-friendly and achieves better results than
embedding-based approaches. To evaluate our method we create special dataset
based on single-choice questions from entrance exams for law apprenticeships
conducted in Poland. The proposed architecture critically leveraged the
abilities of used large language models, improving the gpt-3.5-turbo-0125 by
419%, allowing it to beat gpt-4o and lifting gpt-4o-mini score from 31% to 86%.
At the end of our paper we show the possible future path of research and
potential applications of our findings.
Authors' comments: 8 pages, 2 figures, presented at ICAART 2025, in proceedings of the
17th International Conference on Agents and Artificial Intelligence - Volume
3: ICAART
Yuheng Wang, Xianhe Tang, Pufeng Huang
Memes are widely used in online social interactions, providing vivid, intuitive, and often humorous means to express intentions and emotions. Existing dialogue datasets are predominantly limited to either manually annotated or pure-text conversations, lacking the expressiveness and contextual nuance that multimodal interactions provide.To address these challenges, we introduce MemeCMD, an automatically generated Chinese Multi-turn Dialogue dataset with contextually retrieved memes. Our dataset combines a large-scale, MLLM-annotated meme library with dialogues auto-generated by dual agents across diverse scenarios. We introduce a retrieval framework and adaptive threshold to ensure contextually relevant, naturally spaced meme usage. Experiments demonstrate the effectiveness of our approach in generating contextually appropriate and diverse meme-incorporated dialogues, offering a scalable and privacy-preserving resource for advancing multimodal conversational AI.
Chong Zhang, Xichao Liu, Yibing Zhan, Dapeng Tao, Jun Ni
Recent advancements in spaceborne GNSS missions have produced extensive
global datasets, providing a robust basis for deep learning-based significant
wave height (SWH) retrieval. While existing deep learning models predominantly
utilize CYGNSS data with four-channel information, they often adopt
single-channel inputs or simple channel concatenation without leveraging the
benefits of cross-channel information interaction during training. To address
this limitation, a novel spatial-channel attention-based network, namely
SCAWaveNet, is proposed for SWH retrieval. Specifically, features from each
channel of the DDMs are modeled as independent attention heads, enabling the
fusion of spatial and channel-wise information. For auxiliary parameters, a
lightweight attention mechanism is designed to assign weights along the spatial
and channel dimensions. The final feature integrates both spatial and
channel-level characteristics. Model performance is evaluated using
four-channel CYGNSS data. When ERA5 is used as a reference, SCAWaveNet achieves
an average RMSE of 0.438 m. When using buoy data from NDBC, the average RMSE
reaches 0.432 m. Compared to state-of-the-art models, SCAWaveNet reduces the
average RMSE by at least 3.52% on the ERA5 dataset and by 5.47% on the NDBC
buoy observations. The code is available at
https://github.com/Clifx9908/SCAWaveNet.
Authors' comments: 16 pages,6 tables,11 figures
Jianghao Lin, Xinyuan Wang, Xinyi Dai, Menghui Zhu, Bo Chen, Ruiming Tang, Yong Yu, Weinan Zhang
Tool retrieval is a critical component in enabling large language models (LLMs) to interact effectively with external tools. It aims to precisely filter the massive tools into a small set of candidates for the downstream tool-augmented LLMs. However, most existing approaches primarily focus on optimizing tool representations, often neglecting the importance of precise query comprehension. To address this gap, we introduce MassTool, a multi-task search-based framework designed to enhance both query representation and tool retrieval accuracy. MassTool employs a two-tower architecture: a tool usage detection tower that predicts the need for function calls, and a tool retrieval tower that leverages a query-centric graph convolution network (QC-GCN) for effective query-tool matching. It also incorporates search-based user intent modeling (SUIM) to handle diverse and out-of-distribution queries, alongside an adaptive knowledge transfer (AdaKT) module for efficient multi-task learning. By jointly optimizing tool usage detection loss, list-wise retrieval loss, and contrastive regularization loss, MassTool establishes a robust dual-step sequential decision-making pipeline for precise query understanding. Extensive experiments demonstrate its effectiveness in improving retrieval accuracy. Our code is available at https://github.com/wxydada/MassTool.
Ikechukwu Ogbonna, Lesley Davidson, Soumya Banerjee, Abhishek Dasgupta, Laurence Kenney, Vikranth Harthikote Nagaraja
Millions of people in African countries face barriers to accessing healthcare
due to language and literacy gaps. This research tackles this challenge by
transforming complex medical documents -- in this case, prosthetic device user
manuals -- into accessible formats for underserved populations. This case study
in cross-cultural translation is particularly pertinent/relevant for
communities that receive donated prosthetic devices but may not receive the
accompanying user documentation. Or, if available online, may only be available
in formats (e.g., language and readability) that are inaccessible to local
populations (e.g., English-language, high resource settings/cultural context).
The approach is demonstrated using the widely spoken Pidgin dialect, but our
open-source framework has been designed to enable rapid and easy extension to
other languages/dialects. This work presents an AI-powered framework designed
to process and translate complex medical documents, e.g., user manuals for
prosthetic devices, into marginalised languages. The system enables users --
such as healthcare workers or patients -- to upload English-language medical
equipment manuals, pose questions in their native language, and receive
accurate, localised answers in real time. Technically, the system integrates a
Retrieval-Augmented Generation (RAG) pipeline for processing and semantic
understanding of the uploaded manuals. It then employs advanced Natural
Language Processing (NLP) models for generative question-answering and
multilingual translation. Beyond simple translation, it ensures accessibility
to device instructions, treatment protocols, and safety information, empowering
patients and clinicians to make informed healthcare decisions.
Authors' comments: 5 pages, 0 figures, 0 tables
Shadman Sobhan, Mohammad Ariful Haque
Large Language Models (LLMs) are capable of natural language understanding
and generation. But they face challenges such as hallucination and outdated
knowledge. Fine-tuning is one possible solution, but it is resource-intensive
and must be repeated with every data update. Retrieval-Augmented Generation
(RAG) offers an efficient solution by allowing LLMs to access external
knowledge sources. However, traditional RAG pipelines struggle with retrieving
information from complex technical documents with structured data such as
tables and images. In this work, we propose a RAG pipeline, capable of handling
tables and images in documents, for technical documents that support both
scanned and searchable formats. Its retrieval process combines vector
similarity search with a fine-tuned reranker based on Gemma-2-9b-it. The
reranker is trained using RAFT (Retrieval-Augmented Fine-Tuning) on a custom
dataset designed to improve context identification for question answering. Our
evaluation demonstrates that the proposed pipeline achieves a high faithfulness
score of 94% (RAGas) and 96% (DeepEval), and an answer relevancy score of 87%
(RAGas) and 93% (DeepEval). Comparative analysis demonstrates that the proposed
architecture is superior to general RAG pipelines in terms of table-based
questions and handling questions outside context.
Authors' comments: 29 Pages, 11 Tables
Suofei Zhang, Xinxin Wang, Xiaofu Wu, Quan Zhou, Haifeng Hu
Existing deep learning-based cross-view geo-localization methods primarily focus on improving the accuracy of cross-domain image matching, rather than enabling models to comprehensively capture contextual information around the target and minimize the cost of localization errors. To support systematic research into this Distance-Aware Cross-View Geo-Localization (DACVGL) problem, we construct Distance-Aware Campus (DA-Campus), the first benchmark that pairs multi-view imagery with precise distance annotations across three spatial resolutions. Based on DA-Campus, we formulate DACVGL as a hierarchical retrieval problem across different domains. Our study further reveals that, due to the inherent complexity of spatial relationships among buildings, this problem can only be addressed via a contrastive learning paradigm, rather than conventional metric learning. To tackle this challenge, we propose Dynamic Contrastive Learning (DyCL), a novel framework that progressively aligns feature representations according to hierarchical spatial margins. Extensive experiments demonstrate that DyCL is highly complementary to existing multi-scale metric learning methods and yields substantial improvements in both hierarchical retrieval performance and overall cross-view geo-localization accuracy. Our code and benchmark are publicly available at https://github.com/anocodetest1/DyCL.