Zehua Pei, Ying Zhang, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu
Mixture-of-experts (MoE) architectures enable scaling large language models (LLMs) to vast parameter counts without a proportional rise in computational costs. However, the significant memory demands of large MoE models hinder their deployment across various computational environments, from cloud servers to consumer devices. This study first demonstrates pronounced task-specific specialization in expert activation patterns within MoE layers. Building on this, we introduce PreMoe, a novel framework that enables efficient deployment of massive MoE models in memory-constrained environments. PreMoe features two main components: probabilistic expert pruning (PEP) and task-adaptive expert retrieval (TAER). PEP employs a new metric, the task-conditioned expected selection score (TCESS), derived from router logits to quantify expert importance for specific tasks, thereby identifying a minimal set of critical experts. TAER leverages these task-specific expert importance profiles for efficient inference. It pre-computes and stores compact expert patterns for diverse tasks. When a user query is received, TAER rapidly identifies the most relevant stored task pattern and reconstructs the model by loading only the small subset of experts crucial for that task. This approach dramatically reduces the memory footprint across all deployment scenarios. DeepSeek-R1 671B maintains 97.2\% accuracy on MATH500 when pruned to 8/128 configuration (50\% expert reduction), and still achieves 72.0\% with aggressive 8/32 pruning (87.5\% expert reduction). Pangu-Ultra-MoE 718B achieves 97.15\% on MATH500 and 81.3\% on AIME24 with 8/128 pruning, while even more aggressive pruning to 4/64 (390GB memory) preserves 96.95\% accuracy on MATH500. We make our code publicly available at https://github.com/JarvisPei/PreMoe.
Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin
Training robust retrieval and reranker models typically relies on large-scale
retrieval datasets; for example, the BGE collection contains 1.6 million
query-passage pairs sourced from various data sources. However, we find that
certain datasets can negatively impact model effectiveness -- pruning 8 out of
15 datasets from the BGE collection reduces the training set size by
2.35$\times$ and increases nDCG@10 on BEIR by 1.0 point. This motivates a
deeper examination of training data quality, with a particular focus on "false
negatives", where relevant passages are incorrectly labeled as irrelevant. We
propose a simple, cost-effective approach using cascading LLM prompts to
identify and relabel hard negatives. Experimental results show that relabeling
false negatives with true positives improves both E5 (base) and Qwen2.5-7B
retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot
AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on
the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the
cascading design is further supported by human annotation results, where we
find judgment by GPT-4o shows much higher agreement with humans than
GPT-4o-mini.
Authors' comments: Code is available at https://github.com/castorini/rlhn & datasets are
available at https://huggingface.co/rlhn
Hongru Song, Yu-an Liu, Ruqing Zhang, Jiafeng Guo, Yixing Fan
Retrieval-augmented generation (RAG) systems can effectively mitigate the
hallucination problem of large language models (LLMs),but they also possess
inherent vulnerabilities. Identifying these weaknesses before the large-scale
real-world deployment of RAG systems is of great importance, as it lays the
foundation for building more secure and robust RAG systems in the future.
Existing adversarial attack methods typically exploit knowledge base poisoning
to probe the vulnerabilities of RAG systems, which can effectively deceive
standard RAG models. However, with the rapid advancement of deep reasoning
capabilities in modern LLMs, previous approaches that merely inject incorrect
knowledge are inadequate when attacking RAG systems equipped with deep
reasoning abilities. Inspired by the deep thinking capabilities of LLMs, this
paper extracts reasoning process templates from R1-based RAG systems, uses
these templates to wrap erroneous knowledge into adversarial documents, and
injects them into the knowledge base to attack RAG systems. The key idea of our
approach is that adversarial documents, by simulating the chain-of-thought
patterns aligned with the model's training signals, may be misinterpreted by
the model as authentic historical reasoning processes, thus increasing their
likelihood of being referenced. Experiments conducted on the MS MARCO passage
ranking dataset demonstrate the effectiveness of our proposed method.
Authors' comments: 7 pages,3 figures
Derong Xu, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Maolin Wang, Qidong Liu, Xiangyu Zhao, Yichao Wang et al.
Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information. Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system. Building on this foundation, graph-based RAG systems go a step further by retrieving subgraphs, which preserve the relationships between knowledge entities and provide more comprehensive context. However, graph RAG faces two challenges: (1) Retrieving relevant information introduces irrelevant nodes (especially in dense graph databases, where retrieval usually extends to adjacent nodes), and leads to overly lengthy inputs that hinder efficiency; (2) The representation gap between graph and language during generation with LLMs limits the ability to fully leverage graph structures for enhanced understanding. To address these limitations, we propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase. It first formulates a subgraph by retrieving nodes and edges. Then an Aligner is proposed to jointly optimizes a graph encoder with LLM-summarized reasoning. It achieves dual alignment of graph node and representation by leveraging KL divergence loss and contrastive loss, facilitating efficient pruning of irrelevant knowledge and establishing a unified semantic space. The Generator integrates the aligned graph data with LLM to produce coherent and accurate answers. Experiments on GraphQA benchmark across three tasks (including common sense reasoning, scene graph understanding, and knowledge graph reasoning) validate the effectiveness of our method. The code will be available upon accepted.
Amber Young, Tyler Robinson, Joshua Krissansen-Totton, Edward Schwieterman, Giada Arney, Gerrick Lindberg, Cristina Thomas
Robust exoplanet characterization studies are underway, and the community is
looking ahead toward developing observational strategies to search for life
beyond our solar system. With the development of life detection approaches like
searching for atmospheric chemical species indicative of life, chemical
disequilibrium has also been proposed as a potentially key signature for life.
Chemical disequilibrium can arise from the production of waste gases due to
biological processes and can be quantified using a metric known as the
available Gibbs free energy. The main goal of this study was to explore the
detectability of chemical disequilibrium for a modern Earth-like analog.
Atmospheric retrievals coupled to a thermodynamics model were used to determine
posterior distributions for the available Gibbs free energy given simulated
observations at various noise levels. In reflected light, chemical
disequilibrium signals were difficult to detect and limited by the constraints
on the CH4 abundance, which was challenging to constrain for a modern Earth
case with simulated observations spanning ultraviolet through near-infrared
wavelengths with V-band SNRs of 10, 20, and 40. For a modern Earth analog
orbiting a late-type M dwarf, we simulated transit observations with the James
Webb Space Telescope Mid-Infrared Instrument (MIRI) and found that tight
constraints on the available Gibbs free energy can be achieved, but only at
extremely low noise on the order of several ppm. This study serves as further
proof of concept for remotely inferring chemical disequilibrium biosignatures
and should be included in continuing to build life detection strategies for
future exoplanet characterization missions.
Authors' comments: 25 pages, 17 figures. Accepted to ApJ. Accessible figure data:
https://doi.org/10.5281/zenodo.15485517 Python Thermodynamics Model:
https://github.com/Bellatrix12/Python_Equilibrium_Code.git
Clayton Cohn, Surya Rayala, Caitlin Snyder, Joyce Fonteles, Shruti Jain, Naveeduddin Mohammed, Umesh Timalsina, Sarah K. Burriss et al.
Collaborative dialogue offers rich insights into students' learning and
critical thinking, which is essential for personalizing pedagogical agent
interactions in STEM+C settings. While large language models (LLMs) facilitate
dynamic pedagogical interactions, hallucinations undermine confidence, trust,
and instructional value. Retrieval-augmented generation (RAG) grounds LLM
outputs in curated knowledge but requires a clear semantic link between user
input and a knowledge base, which is often weak in student dialogue. We propose
log-contextualized RAG (LC-RAG), which enhances RAG retrieval by using
environment logs to contextualize collaborative discourse. Our findings show
that LC-RAG improves retrieval over a discourse-only baseline and allows our
collaborative peer agent, Copa, to deliver relevant, personalized guidance that
supports students' critical thinking and epistemic decision-making in a
collaborative computational modeling environment, C2STEM.
Authors' comments: To appear in the International Conference on Artificial Intelligence
in Education (AIED25) Workshop on Epistemics and Decision-Making in
AI-Supported Education
Yuelyu Ji, Rui Meng, Zhuochun Li, Daqing He
Retrieval-augmented generation (RAG) grounds large language models (LLMs) in up-to-date external evidence, yet existing multi-hop RAG pipelines still issue redundant subqueries, explore too shallowly, or wander through overly long search chains. We introduce EVO-RAG, a curriculum-guided reinforcement learning framework that evolves a query-rewriting agent from broad early-stage exploration to concise late-stage refinement. EVO-RAG couples a seven-factor, step-level reward vector (covering relevance, redundancy, efficiency, and answer correctness) with a time-varying scheduler that reweights these signals as the episode unfolds. The agent is trained with Direct Preference Optimization over a multi-head reward model, enabling it to learn when to search, backtrack, answer, or refuse. Across four multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle), EVO-RAG boosts Exact Match by up to 4.6 points over strong RAG baselines while trimming average retrieval depth by 15 %. Ablation studies confirm the complementary roles of curriculum staging and dynamic reward scheduling. EVO-RAG thus offers a general recipe for building reliable, cost-effective multi-hop RAG systems.
Jiehan Cheng, Zhicheng Dou
We propose DailyQA, an automatically updated dynamic dataset that updates questions weekly and contains answers to questions on any given date. DailyQA utilizes daily updates from Wikipedia revision logs to implement a fully automated pipeline of data filtering, query generation synthesis, quality checking, answer extraction, and query classification. The benchmark requires large language models (LLMs) to process and answer questions involving fast-changing factual data and covering multiple domains. We evaluate several open-source and closed-source LLMs using different RAG pipelines with web search augmentation. We compare the ability of different models to process time-sensitive web information and find that rerank of web retrieval results is critical. Our results indicate that LLMs still face significant challenges in handling frequently updated information, suggesting that DailyQA benchmarking provides valuable insights into the direction of progress for LLMs and RAG systems.
Taiye Chen, Zeming Wei, Ang Li, Yisen Wang
Large Language Models (LLMs) are known to be vulnerable to jailbreaking attacks, wherein adversaries exploit carefully engineered prompts to induce harmful or unethical responses. Such threats have raised critical concerns about the safety and reliability of LLMs in real-world deployment. While existing defense mechanisms partially mitigate such risks, subsequent advancements in adversarial techniques have enabled novel jailbreaking methods to circumvent these protections, exposing the limitations of static defense frameworks. In this work, we explore defending against evolving jailbreaking threats through the lens of context retrieval. First, we conduct a preliminary study demonstrating that even a minimal set of safety-aligned examples against a particular jailbreak can significantly enhance robustness against this attack pattern. Building on this insight, we further leverage the retrieval-augmented generation (RAG) techniques and propose Safety Context Retrieval (SCR), a scalable and robust safeguarding paradigm for LLMs against jailbreaking. Our comprehensive experiments demonstrate how SCR achieves superior defensive performance against both established and emerging jailbreaking tactics, contributing a new paradigm to LLM safety. Our code will be available upon publication.
Yutao Zhu, Jiajie Jin, Hongjin Qian, Zheng Liu, Zhicheng Dou, Ji-Rong Wen
Existing studies have optimized retrieval-augmented generation (RAG) across various sub-tasks, such as query understanding and retrieval refinement, but integrating these optimizations into a unified framework remains challenging. To tackle this problem, this work proposes RoleRAG, a unified RAG framework that achieves efficient multi-task processing through role-specific token optimization. RoleRAG comprises six modules, each handling a specific sub-task within the RAG process. Additionally, we introduce a query graph to represent the decomposition of the query, which can be dynamically resolved according to the decomposing state. All modules are driven by the same underlying LLM, distinguished by task-specific role tokens that are individually optimized. This design allows RoleRAG to dynamically activate different modules within a single LLM instance, thereby streamlining deployment and reducing resource consumption. Experimental results on five open-domain question-answering datasets demonstrate the effectiveness, generalizability, and flexibility of our framework.
Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, Jieru Zhao
Recent developments in Video Large Language Models (Video LLMs) have enabled models to process long video sequences and demonstrate remarkable performance. Nonetheless, studies predominantly focus on offline video question answering, neglecting memory usage and response speed that are essential in various real-world applications, such as Deepseek services, autonomous driving, and robotics. To mitigate these challenges, we propose $\textbf{LiveVLM}$, a training-free framework specifically designed for streaming, online video understanding and real-time interaction. Unlike existing works that process videos only after one question is posed, LiveVLM constructs an innovative streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs, ensuring prompt responses to user queries. For continuous video streams, LiveVLM generates and compresses video key-value tensors (video KVs) to reserve visual information while improving memory efficiency. Furthermore, when a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information, while minimizing interference from redundant context. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to process 44$\times$ number of frames on the same device, and achieves up to 5$\times$ speedup in response speed compared with SoTA online methods at an input of 256 frames, while maintaining the same or better model performance.
Xinbang Dai, Huikang Hu, Yuncheng Hua, Jiaqi Li, Yongrui Chen, Rihui Jin, Nan Hu, Guilin Qi
Retrieval-augmented generation (RAG) systems face critical challenges in
balancing internal (parametric) and external (retrieved) knowledge, especially
when these sources conflict or are unreliable. To analyze these scenarios
comprehensively, we construct the Trustworthiness Response Dataset (TRD) with
36,266 questions spanning four RAG settings. We reveal that existing approaches
address isolated scenarios-prioritizing one knowledge source, naively merging
both, or refusing answers-but lack a unified framework to handle different
real-world conditions simultaneously. Therefore, we propose the BRIDGE
framework, which dynamically determines a comprehensive response strategy of
large language models (LLMs). BRIDGE leverages an adaptive weighting mechanism
named soft bias to guide knowledge collection, followed by a Maximum Soft-bias
Decision Tree to evaluate knowledge and select optimal response strategies
(trust internal/external knowledge, or refuse). Experiments show BRIDGE
outperforms baselines by 5-15% in accuracy while maintaining balanced
performance across all scenarios. Our work provides an effective solution for
LLMs' trustworthy responses in real-world RAG applications.
Authors' comments: 24 pages, 8 figures
Yuyang Dong, Nobuhiro Ueda, Krisztián Boros, Daiki Ito, Takuya Sera, Masafumi Oyamada
With the increasing adoption of Large Language Models (LLMs) and
Vision-Language Models (VLMs), rich document analysis technologies for
applications like Retrieval-Augmented Generation (RAG) and visual RAG are
gaining significant attention. Recent research indicates that using VLMs can
achieve better RAG performance, but processing rich documents still remains a
challenge since a single page contains large amounts of information. In this
paper, we present SCAN (\textbf{S}emanti\textbf{C} Document Layout
\textbf{AN}alysis), a novel approach enhancing both textual and visual
Retrieval-Augmented Generation (RAG) systems working with visually rich
documents. It is a VLM-friendly approach that identifies document components
with appropriate semantic granularity, balancing context preservation with
processing efficiency. SCAN uses a coarse-grained semantic approach that
divides documents into coherent regions covering continuous components. We
trained the SCAN model by fine-tuning object detection models with
sophisticated annotation datasets. Our experimental results across English and
Japanese datasets demonstrate that applying SCAN improves end-to-end textual
RAG performance by up to 9.0\% and visual RAG performance by up to 6.4\%,
outperforming conventional approaches and even commercial document processing
solutions.
Authors' comments: v1
Ehsan Doostmohammadi, Marco Kuhlmann
Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query--context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40\% without compromising performance. We specifically generate synthetic context through paraphrasing queries. We validate our perplexity-based findings on question-answering tasks, confirming that the benefits of retrieval-augmented language modeling extend to practical applications. Our results provide empirical evidence of significant optimization potential for retrieval mechanisms in language model pretraining.
Jiankun Zhang, Shenglai Zeng, Jie Ren, Tianqi Zheng, Hui Liu, Xianfeng Tang, Hui Liu, Yi Chang
Multimodal Retrieval-Augmented Generation (MRAG) systems enhance LMMs by integrating external multimodal databases, but introduce unexplored privacy vulnerabilities. While text-based RAG privacy risks have been studied, multimodal data presents unique challenges. We provide the first systematic analysis of MRAG privacy vulnerabilities across vision-language and speech-language modalities. Using a novel compositional structured prompt attack in a black-box setting, we demonstrate how attackers can extract private information by manipulating queries. Our experiments reveal that LMMs can both directly generate outputs resembling retrieved content and produce descriptions that indirectly expose sensitive information, highlighting the urgent need for robust privacy-preserving MRAG techniques.
Faeze Ghorbanpour, Daryna Dementieva, Alexander Fraser
Considering the importance of detecting hateful language, labeled hate speech data is expensive and time-consuming to collect, particularly for low-resource languages. Prior work has demonstrated the effectiveness of cross-lingual transfer learning and data augmentation in improving performance on tasks with limited labeled data. To develop an efficient and scalable cross-lingual transfer learning approach, we leverage nearest-neighbor retrieval to augment minimal labeled data in the target language, thereby enhancing detection performance. Specifically, we assume access to a small set of labeled training instances in the target language and use these to retrieve the most relevant labeled examples from a large multilingual hate speech detection pool. We evaluate our approach on eight languages and demonstrate that it consistently outperforms models trained solely on the target language data. Furthermore, in most cases, our method surpasses the current state-of-the-art. Notably, our approach is highly data-efficient, retrieving as small as 200 instances in some cases while maintaining superior performance. Moreover, it is scalable, as the retrieval pool can be easily expanded, and the method can be readily adapted to new languages and tasks. We also apply maximum marginal relevance to mitigate redundancy and filter out highly similar retrieved instances, resulting in improvements in some languages.
Han Deng, Yuan Meng, Shixiang Tang, Wanli Ouyang, Xinzhu Ma
Competitive programming benchmarks are widely used in scenarios such as
programming contests and large language model assessments. However, the growing
presence of duplicate or highly similar problems raises concerns not only about
competition fairness, but also about the validity of competitive programming as
a benchmark for model evaluation. In this paper, we propose a new problem --
similar question retrieval -- to address this issue. Due to the lack of both
data and models, solving this problem is challenging. To this end, we introduce
CPRet, a retrieval-oriented benchmark suite for competitive programming,
covering four retrieval tasks: two code-centric (i.e., Text-to-Code and
Code-to-Code) and two newly proposed problem-centric tasks (i.e.,
Problem-to-Duplicate and Simplified-to-Full), built from a combination of
automatically crawled problem-solution data and manually curated annotations.
Our contribution includes both high-quality training data and temporally
separated test sets for reliable evaluation. In addition, we develop two
task-specialized retrievers based on this dataset: CPRetriever-Code, trained
with a novel Group-InfoNCE loss for problem-code alignment, and
CPRetriever-Prob, fine-tuned for identifying problem-level similarity. Both
models achieve strong results and are open-sourced for local use. Finally, we
analyze LiveCodeBench and find that high-similarity problems inflate model pass
rates and reduce differentiation, underscoring the need for similarity-aware
evaluation in future benchmarks.
Code and data are available at: https://github.com/coldchair/CPRet
Authors' comments: main 9 pages
Guangyuan Ma, Yongliang Ma, Xuanrui Gou, Zhenpeng Su, Ming Zhou, Songlin Hu
Large Language Models (LLMs)-based hybrid retrieval uses LLMs to encode queries and documents into low-dimensional dense or high-dimensional sparse vectors. It retrieves documents relevant to search queries based on vector similarities. Documents are pre-encoded offline, while queries arrive in real-time, necessitating an efficient online query encoder. Although LLMs significantly enhance retrieval capabilities, serving deeply parameterized LLMs slows down query inference throughput and increases demands for online deployment resources. In this paper, we propose LightRetriever, a novel LLM-based hybrid retriever with extremely lightweight query encoders. Our method retains a full-sized LLM for document encoding, but reduces the workload of query encoding to no more than an embedding lookup. Compared to serving a full-sized LLM on an H800 GPU, our approach achieves over a 1000x speedup for query inference with GPU acceleration, and even a 20x speedup without GPU. Experiments on large-scale retrieval benchmarks demonstrate that our method generalizes well across diverse retrieval tasks, retaining an average of 95% full-sized performance.
Zhuchun Li, Xiaoxue Zhao, Xiang Zhou
We study the associative-memory network of Kuramoto-type oscillators that
stores a set of memorized patterns (memories). In [Phys. Rev. Lett., 92 (2004),
108101], Nishikawa, Lai and Hoppensteadt showed that the capacity of this
system for pattern retrieval with small errors can be made as high as that of
the Hopfield network. Some stability analysis efforts focus on mutually
orthogonal memories; however, the theoretical results do not ensure error-free
retrieval in general situations. In this paper, we present a route for using
the model in pattern retrieval problems with small or large errors. We employ
the eigenspectrum analysis of Jacobians and potential analysis of the gradient
flow to derive the stability/instability of binary patterns. For two memories,
the eigenspectrum of Jacobian at each pattern can be specified, which enables
us to give the critical value of the parameter to distinguish the memories from
all other patterns in stability. This setting of two memories substantially
reduces the number of stable patterns and enlarges their basins, allowing us to
recover defective patterns. We extend this approach to general cases and
present a deterministic method for ensuring error-free retrieval across a
general set of standard patterns. Numerical simulations and comparative
analyses illustrate the approach.
Authors' comments: 23 pages, 5 figures, 1 table
Yisheng Zhong, Yizhu Wen, Junfeng Guo, Mehran Kafai, Heng Huang, Hanqing Guo, Zhuangdi Zhu
The protection of cyber Intellectual Property (IP) such as web content is an
increasingly critical concern. The rise of large language models (LLMs) with
online retrieval capabilities enables convenient access to information but
often undermines the rights of original content creators. As users increasingly
rely on LLM-generated responses, they gradually diminish direct engagement with
original information sources, which will significantly reduce the incentives
for IP creators to contribute, and lead to a saturating cyberspace with more
AI-generated content. In response, we propose a novel defense framework that
empowers web content creators to safeguard their web-based IP from unauthorized
LLM real-time extraction and redistribution by leveraging the semantic
understanding capability of LLMs themselves. Our method follows principled
motivations and effectively addresses an intractable black-box optimization
problem. Real-world experiments demonstrated that our methods improve defense
success rates from 2.5% to 88.6% on different LLMs, outperforming traditional
defenses such as configuration-based restrictions.
Authors' comments: 13 pages, 13 figures, 4 tables