Yueqing Xi, Yifan Bai, Huasen Luo, Weiliang Wen, Hui Liu, Haoliang Li
As artificial intelligence permeates judicial forensics, ensuring the veracity and traceability of legal question answering (QA) has become critical. Conventional large language models (LLMs) are prone to hallucination, risking misleading guidance in legal consultation, while static knowledge bases struggle to keep pace with frequently updated statutes and case law. We present a hybrid legal QA agent tailored for judicial settings that integrates retrieval-augmented generation (RAG) with multi-model ensembling to deliver reliable, auditable, and continuously updatable counsel. The system prioritizes retrieval over generation: when a trusted legal repository yields relevant evidence, answers are produced via RAG; otherwise, multiple LLMs generate candidates that are scored by a specialized selector, with the top-ranked answer returned. High-quality outputs then undergo human review before being written back to the repository, enabling dynamic knowledge evolution and provenance tracking. Experiments on the Law\_QA dataset show that our hybrid approach significantly outperforms both a single-model baseline and a vanilla RAG pipeline on F1, ROUGE-L, and an LLM-as-a-Judge metric. Ablations confirm the complementary contributions of retrieval prioritization, model ensembling, and the human-in-the-loop update mechanism. The proposed system demonstrably reduces hallucination while improving answer quality and legal compliance, advancing the practical landing of media forensics technologies in judicial scenarios.
Hung-Shin Lee, Chen-Chi Chang, Ching-Yuan Chen, Yun-Hsiang Hsu
This study proposes a cognitive benchmarking framework to evaluate how large
language models (LLMs) process and apply culturally specific knowledge. The
framework integrates Bloom's Taxonomy with Retrieval-Augmented Generation (RAG)
to assess model performance across six hierarchical cognitive domains:
Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating.
Using a curated Taiwanese Hakka digital cultural archive as the primary
testbed, the evaluation measures LLM-generated responses' semantic accuracy and
cultural relevance.
Authors' comments: This paper has been accepted by The Electronic Library, and the full
article is now available on Emerald Insight
Xiaoyu Liu, Fuwei Zhang, Yiqing Wu, Xinyu Jia, Zenghua Xia, Fuzhen Zhuang, Zhao Zhang, Fei Jiang et al.
Generative retrieval (GR) has gained significant attention as an effective
paradigm that integrates the capabilities of large language models (LLMs). It
generally consists of two stages: constructing discrete semantic identifiers
(IDs) for documents and retrieving documents by autoregressively generating ID
tokens.The core challenge in GR is how to construct document IDs (DocIDS) with
strong representational power. Good IDs should exhibit two key properties:
similar documents should have more similar IDs, and each document should
maintain a distinct and unique ID.However, most existing methods ignore native
category information, which is common and critical in E-commerce. Therefore, we
propose a novel ID learning method, CAtegory-Tree Integrated Document
IDentifier (CAT-ID$^2$), incorporating prior category information into the
semantic IDs.CAT-ID$^2$ includes three key modules: a Hierarchical Class
Constraint Loss to integrate category information layer by layer during
quantization, a Cluster Scale Constraint Loss for uniform ID token
distribution, and a Dispersion Loss to improve the distinction of reconstructed
documents. These components enable CAT-ID$^2$ to generate IDs that make similar
documents more alike while preserving the uniqueness of different documents'
representations.Extensive offline and online experiments confirm the
effectiveness of our method, with online A/B tests showing a 0.33% increase in
average orders per thousand users for ambiguous intent queries and 0.24% for
long-tail queries.
Authors' comments: Accepted by WSDM'26
Heng Zhou, Ao Yu, Yuchen Fan, Jianing Shi, Li Kang, Hejia Geng, Yongting Zhang, Yutao Fan et al.
Evaluating large language models (LLMs) on question answering often relies on static benchmarks that reward memorization and understate the role of retrieval, failing to capture the dynamic nature of world knowledge. We present LiveSearchBench, an automated pipeline for constructing retrieval-dependent benchmarks from recent knowledge updates. Our method computes deltas between successive Wikidata snapshots, filters candidate triples for quality, and synthesizes natural-language questions at three levels of reasoning difficulty, each guaranteed to admit a unique, verifiable answer through SPARQL validation. The pipeline is fully automated, scalable across time, and minimizes human intervention, enabling continual regeneration of temporally grounded benchmarks. Experiments show a pronounced performance drop when models confront facts that post-date pretraining, with the gap most salient on multi-hop queries. Retrieval augmented methods and larger, instruction-tuned models provide partial gains but fail to close this recency gap. By design, LiveSearchBench shifts evaluation from static memorization toward tasks that require up-to-date retrieval and reasoning, offering a foundation for systematic, long-term assessment of LLMs under evolving knowledge.
Muhammed Yusuf Kartal, Suha Kagan Kose, Korhan Sevinç, Burak Aktas
Retrieval-Augmented Generation (RAG) quality depends on many interacting
choices across retrieval, ranking, augmentation, prompting, and generation, so
optimizing modules in isolation is brittle. We introduce RAGSmith, a modular
framework that treats RAG design as an end-to-end architecture search over nine
technique families and 46{,}080 feasible pipeline configurations. A genetic
search optimizes a scalar objective that jointly aggregates retrieval metrics
(recall@k, mAP, nDCG, MRR) and generation metrics (LLM-Judge and semantic
similarity). We evaluate on six Wikipedia-derived domains (Mathematics, Law,
Finance, Medicine, Defense Industry, Computer Science), each with 100 questions
spanning factual, interpretation, and long-answer types. RAGSmith finds
configurations that consistently outperform naive RAG baseline by +3.8\% on
average (range +1.2\% to +6.9\% across domains), with gains up to +12.5\% in
retrieval and +7.5\% in generation. The search typically explores $\approx
0.2\%$ of the space ($\sim 100$ candidates) and discovers a robust backbone --
vector retrieval plus post-generation reflection/revision -- augmented by
domain-dependent choices in expansion, reranking, augmentation, and prompt
reordering; passage compression is never selected. Improvement magnitude
correlates with question type, with larger gains on factual/long-answer mixes
than interpretation-heavy sets. These results provide practical, domain-aware
guidance for assembling effective RAG systems and demonstrate the utility of
evolutionary search for full-pipeline optimization.
Authors' comments: 45 pages
Lvhua Wu, Xuefeng Jiang, Sheng Sun, Tian Wen, Yuwei Wang, Min Liu
The rapid spread of fake news threatens social stability and public trust, rendering its detection an imperative research priority. Although large language models (LLMs) excel at numerous natural language processing tasks with their remarkable contextual understanding and extensive prior knowledge, the time-bounded knowledge coverage and tendency for generating hallucination content reduce their reliability when handling fast-evolving news streams. Furthermore, models trained on existing static datasets also often lack the generalization needed for emerging news topics. To address these challenges, we propose ZoFia, a novel two-stage zero-shot fake news detection framework. First, we introduce Hierarchical Salience to quantify the importance of entities in the news content, and propose the SC-MMR algorithm to effectively select an informative and diverse set of keywords that serve as queries for retrieving up-to-date external evidence. Subsequently, a multi LLM interactive system, in which each agent assumes a distinct role, performs multi-view collaborative analysis and adversarial debate over the news text and its related information, and finally produces an interpretable and robust judgment. Comprehensive experiments on two public datasets demonstrate that ZoFia obviously outperforms existing zero-shot baselines and most of few-shot methods. Our codes will be open-sourced to facilitate related communities.
Cuong Van Duc, Thai Tran Quoc, Minh Nguyen Dinh Tuan, Tam Vu Duc, Son Nguyen Van, Hanh Nguyen Thi
Detecting student misconceptions in open-ended responses is a longstanding challenge, demanding semantic precision and logical reasoning. We propose MiRAGE - Misconception Detection with Retrieval-Guided Multi-Stage Reasoning and Ensemble Fusion, a novel framework for automated misconception detection in mathematics. MiRAGE operates in three stages: (1) a Retrieval module narrows a large candidate pool to a semantically relevant subset; (2) a Reasoning module employs chain-of-thought generation to expose logical inconsistencies in student solutions; and (3) a Reranking module refines predictions by aligning them with the reasoning. These components are unified through an ensemble-fusion strategy that enhances robustness and interpretability. On mathematics datasets, MiRAGE achieves Mean Average Precision scores of 0.82/0.92/0.93 at levels 1/3/5, consistently outperforming individual modules. By coupling retrieval guidance with multi-stage reasoning, MiRAGE reduces dependence on large-scale language models while delivering a scalable and effective solution for educational assessment.
Junqi Zhao, Chenxing Li, Jinzheng Zhao, Rilin Chen, Dong Yu, Mark D. Plumbley, Wenwu Wang
We propose a general feedback-driven retrieval-augmented generation (RAG) approach that leverages Large Audio Language Models (LALMs) to address the missing or imperfect synthesis of specific sound events in text-to-audio (TTA) generation. Unlike previous RAG-based TTA methods that typically train specialized models from scratch, we utilize LALMs to analyze audio generation outputs, retrieve concepts that pre-trained models struggle to generate from an external database, and incorporate the retrieved information into the generation process. Experimental results show that our method not only enhances the ability of LALMs to identify missing sound events but also delivers improvements across different models, outperforming existing RAG-specialized approaches.
Hanwen Su, Ge Song, Jiyan Wang, Yuanbo Zhu
The problem of zero-shot sketch-based image retrieval (ZS-SBIR) has achieved increasing attention due to its wide applications, e.g. e-commerce. Despite progress made in this field, previous works suffer from using imbalanced samples of modalities and inconsistent low-quality information during training, resulting in sub-optimal performance. Therefore, in this paper, we introduce an approach called Dynamic Multi-level Weighted Alignment Network for ZS-SBIR. It consists of three components: (i) a Uni-modal Feature Extraction Module that includes a CLIP text encoder and a ViT for extracting textual and visual tokens, (ii) a Cross-modal Multi-level Weighting Module that produces an alignment weight list by the local and global aggregation blocks to measure the aligning quality of sketch and image samples, (iii) a Weighted Quadruplet Loss Module aiming to improve the balance of domains in the triplet loss. Experiments on three benchmark datasets, i.e., Sketchy, TU-Berlin, and QuickDraw, show our method delivers superior performances over the state-of-the-art ZS-SBIR methods.
Ahmed Masry, Megh Thakkar, Patrice Bechard, Sathwik Tejaswi Madhusudhan, Rabiul Awal, Shambhavi Mishra, Akshay Kalkunte Suresh, Srivatsava Daruru et al.
Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.
Yilan Liu
Clinical vignettes are essential educational tools in speech-language pathology (SLP), but manual creation is time-intensive. While general-purpose large language models (LLMs) can generate text, they lack domain-specific knowledge, leading to hallucinations and requiring extensive expert revision. This study presents a proof-of-concept system integrating retrieval-augmented generation (RAG) with curated knowledge bases to generate pediatric SLP case materials. A multi-model RAG-based system was prototyped integrating curated domain knowledge with engineered prompt templates, supporting five commercial (GPT-4o, Claude 3.5 Sonnet, Gemini 2.5 Pro) and open-source (Llama 3.2, Qwen 2.5-7B) LLMs. Seven test scenarios spanning diverse disorder types and grade levels were systematically designed. Generated cases underwent automated quality assessment using a multi-dimensional rubric evaluating structural completeness, internal consistency, clinical appropriateness, and IEP goal/session note quality. This proof-of-concept demonstrates technical feasibility for RAG-augmented generation of pediatric SLP vignettes. Commercial models showed marginal quality advantages, but open-source alternatives achieved acceptable performance, suggesting potential for privacy-preserving institutional deployment. Integration of curated knowledge bases enabled content generation aligned with professional guidelines. Extensive validation through expert review, student pilot testing, and psychometric evaluation is required before educational or research implementation. Future applications may extend to clinical decision support, automated IEP goal generation, and clinical reflection training.
Authors' comments: 37 pages, 3 figures
Song Wang, Zihan Chen, Peng Wang, Zhepei Wei, Zhen Tan, Yu Meng, Cong Shen, Jundong Li
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by
integrating external knowledge sources to address their limitations in
accessing up-to-date or specialized information. A natural strategy to increase
the likelihood of retrieving relevant information is to expand the number of
retrieved documents. However, involving more documents could introduce
significant noise, as many documents may be irrelevant or misleading, thereby
reducing the overall accuracy of the generated responses. To overcome the
challenge associated with handling a larger number of documents, we propose
WinnowRAG, a novel RAG framework designed to systematically filter out noisy
documents while preserving valuable content -- a process we refer to as
winnowing. WinnowRAG operates in two stages: In Stage I, we perform query-aware
clustering to group similar documents and form distinct topic clusters. Each
cluster is assigned to an LLM agent for generating a unique answer. In Stage
II, we perform winnowing, wherein a critic LLM evaluates the outputs of
multiple agents and iteratively separates useful documents from noisy ones. To
retain useful documents when discarding agents, we propose two strategic
merging techniques to ensure that only relevant knowledge is used for
generating the final response. Crucially, WinnowRAG is model-agnostic and does
not require any model fine-tuning, making it easily adaptable to various tasks.
Extensive experiments on various realistic datasets demonstrate the
effectiveness of WinnowRAG over state-of-the-art baselines.
Authors' comments: EMNLP Main 2025
Benjamin Clavié, Xianming Li, Antoine Chaffin, Omar Khattab, Tom Aarsen, Manuel Faysse, Jing Li
Late interaction retrieval methods, pioneered by ColBERT, have emerged as a
powerful alternative to single-vector neural IR. By leveraging fine-grained,
token-level representations, they have been demonstrated to deliver strong
generalisation and robustness, particularly in out-of-domain settings. They
have recently been shown to be particularly well-suited for novel use cases,
such as reasoning-based or cross-modality retrieval. At the same time, these
models pose significant challenges of efficiency, usability, and integrations
into fully fledged systems; as well as the natural difficulties encountered
while researching novel application domains. Recent years have seen rapid
advances across many of these areas, but research efforts remain fragmented
across communities and frequently exclude practitioners. The purpose of this
workshop is to create an environment where all aspects of late interaction can
be discussed, with a focus on early research explorations, real-world outcomes,
and negative or puzzling results to be freely shared and discussed. The aim of
LIR is to provide a highly-interactive environment for researchers from various
backgrounds and practitioners to freely discuss their experience, fostering
further collaboration.
Authors' comments: Accepted workshop at ECIR 2026
Chaochen Wu, Guan Luo, Meiyun Zuo, Zhitao Fan
Video moment retrieval uses a text query to locate a moment from a given untrimmed video reference. Locating corresponding video moments with text queries helps people interact with videos efficiently. Current solutions for this task have not considered conflict within location results from different models, so various models cannot integrate correctly to produce better results. This study introduces a reinforcement learning-based video moment retrieval model that can scan the whole video once to find the moment's boundary while producing its locational evidence. Moreover, we proposed a multi-agent system framework that can use evidential learning to resolve conflicts between agents' localization output. As a side product of observing and dealing with conflicts between agents, we can decide whether a query has no corresponding moment in a video (out-of-scope) without additional training, which is suitable for real-world applications. Extensive experiments on benchmark datasets show the effectiveness of our proposed methods compared with state-of-the-art approaches. Furthermore, the results of our study reveal that modeling competition and conflict of the multi-agent system is an effective way to improve RL performance in moment retrieval and show the new role of evidential learning in the multi-agent framework.
Arman Anwar, Zefang Liu
Traditional cybersecurity tabletop exercises (TTXs) provide valuable training but are often scripted, resource-intensive, and difficult to scale. We introduce AgentBnB, a browser-based re-imagining of the Backdoors & Breaches game that integrates large language model teammates with a Bloom-aligned, retrieval-augmented copilot (C2D2). The system expands a curated corpus into factual, conceptual, procedural, and metacognitive snippets, delivering on-demand, cognitively targeted hints. Prompt-engineered agents employ a scaffolding ladder that gradually fades as learner confidence grows. In a solo-player pilot with four graduate students, participants reported greater intention to use the agent-based version compared to the physical card deck and viewed it as more scalable, though a ceiling effect emerged on a simple knowledge quiz. Despite limitations of small sample size, single-player focus, and narrow corpus, these early findings suggest that large language model augmented TTXs can provide lightweight, repeatable practice without the logistical burden of traditional exercises. Planned extensions include multi-player modes, telemetry-driven coaching, and comparative studies with larger cohorts.
Xiaocong Yang
Modern dense information retrieval (IR) models usually rely on costly large-scale pretraining. In this paper, we introduce LLM2IR, an efficient unsupervised contrastive learning framework to convert any decoder-only large language model (LLM) to an information retrieval model. Despite its simplicity, the effectiveness is proven among different LLMs on multiple IR benchmarks including LoCo, LongEmbed and BEIR. We also find that models with a longer context length tend to have a stronger IR capacity by comparing task performances of models in the same model family. Our work not only provides an effective way to build IR models on the state-of-the-art LLMs, but also shed light on the relationship between information retrieval ability and model context length, which helps the design of better information retrievers.
Authors' comments: MS Thesis
Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Xiaowen Chu
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.
Yulong Hui, Chao Chen, Zhihang Fu, Yihao Liu, Jieping Ye, Huanchen Zhang
Retrieval-Augmented Generation (RAG) has significantly enhanced LLMs by incorporating external information. However, prevailing agentic RAG approaches are constrained by a critical limitation: they treat the retrieval process as a black-box querying operation. This confines agents' actions to query issuing, hindering its ability to tackle complex information-seeking tasks. To address this, we introduce Interact-RAG, a new paradigm that elevates the LLM agent from a passive query issuer into an active manipulator of the retrieval process. We dismantle the black-box with a Corpus Interaction Engine, equipping the agent with a set of action primitives for fine-grained control over information retrieval. To further empower the agent on the entire RAG pipeline, we first develop a reasoning-enhanced workflow, which enables both zero-shot execution and the synthesis of interaction trajectories. We then leverage this synthetic data to train a fully autonomous end-to-end agent via Supervised Fine-Tuning (SFT), followed by refinement with Reinforcement Learning (RL). Extensive experiments across six benchmarks demonstrate that Interact-RAG significantly outperforms other advanced methods, validating the efficacy of our reasoning-interaction strategy.
Mohammad Zahangir Alam, Khandoker Ashik Uz Zaman, Mahdi H. Miraz
Retrieval-Augmented Generation (RAG) shows significant promise in knowledge-intensive tasks by improving domain specificity, enhancing temporal relevance, and reducing hallucinations. However, applying RAG to finance encounters critical challenges: restricted access to proprietary datasets, limited retrieval accuracy, regulatory constraints, and sensitive data interpretation. We introduce AstuteRAG-FQA, an adaptive RAG framework tailored for Financial Question Answering (FQA), leveraging task-aware prompt engineering to address these challenges. The framework uses a hybrid retrieval strategy integrating both open-source and proprietary financial data while maintaining strict security protocols and regulatory compliance. A dynamic prompt framework adapts in real time to query complexity, improving precision and contextual relevance. To systematically address diverse financial queries, we propose a four-tier task classification: explicit factual, implicit factual, interpretable rationale, and hidden rationale involving implicit causal reasoning. For each category, we identify key challenges, datasets, and optimization techniques within the retrieval and generation process. The framework incorporates multi-layered security mechanisms including differential privacy, data anonymization, and role-based access controls to protect sensitive financial information. Additionally, AstuteRAG-FQA implements real-time compliance monitoring through automated regulatory validation systems that verify responses against industry standards and legal obligations. We evaluate three data integration techniques - contextual embedding, small model augmentation, and targeted fine-tuning - analyzing their efficiency and feasibility across varied financial environments.
Chuxuan Hu, Maxwell Yang, James Weiland, Yeji Lim, Suhas Palawala, Daniel Kang
Manually conducting real-world data analyses is labor-intensive and
inefficient. Despite numerous attempts to automate data science workflows, none
of the existing paradigms or systems fully demonstrate all three key
capabilities required to support them effectively: (1) open-domain data
collection, (2) structured data transformation, and (3) analytic reasoning.
To overcome these limitations, we propose DRAMA, an end-to-end paradigm that
answers users' analytic queries in natural language on large-scale open-domain
data. DRAMA unifies data collection, transformation, and analysis as a single
pipeline. To quantitatively evaluate system performance on tasks representative
of DRAMA, we construct a benchmark, DRAMA-Bench, consisting of two categories
of tasks: claim verification and question answering, each comprising 100
instances. These tasks are derived from real-world applications that have
gained significant public attention and require the retrieval and analysis of
open-domain data. We develop DRAMA-Bot, a multi-agent system designed following
DRAMA. It comprises a data retriever that collects and transforms data by
coordinating the execution of sub-agents, and a data analyzer that performs
structured reasoning over the retrieved data. We evaluate DRAMA-Bot on
DRAMA-Bench together with five state-of-the-art baseline agents. DRAMA-Bot
achieves 86.5% task accuracy at a cost of $0.05, outperforming all baselines
with up to 6.9 times the accuracy and less than 1/6 of the cost. DRAMA is
publicly available at https://github.com/uiuc-kang-lab/drama.
Authors' comments: Accepted to SIGMOD 2026