Huawei Lin, Jikai Long, Zhaozhuo Xu, Weijie Zhao
Given a Large Language Model (LLM) generation, how can we identify which
training data led to this generation? In this paper, we proposed RapidIn, a
scalable framework adapting to LLMs for estimating the influence of each
training data. The proposed framework consists of two stages: caching and
retrieval. First, we compress the gradient vectors by over 200,000x, allowing
them to be cached on disk or in GPU/CPU memory. Then, given a generation,
RapidIn efficiently traverses the cached gradients to estimate the influence
within minutes, achieving over a 6,326x speedup. Moreover, RapidIn supports
multi-GPU parallelization to substantially accelerate caching and retrieval.
Our empirical result confirms the efficiency and effectiveness of RapidIn.
Authors' comments: Accepted to ACL 2024. Keywords: Influence Function, Influence
Estimation, Training Data Attribution
Haowen Hou, Xiaopeng Yan, Yigeng Zhang, Fengzong Lian, Zhanhui Kang
In the field of cross-modal retrieval, single encoder models tend to perform
better than dual encoder models, but they suffer from high latency and low
throughput. In this paper, we present a dual encoder model called BagFormer
that utilizes a cross modal interaction mechanism to improve recall performance
without sacrificing latency and throughput. BagFormer achieves this through the
use of bag-wise interactions, which allow for the transformation of text to a
more appropriate granularity and the incorporation of entity knowledge into the
model. Our experiments demonstrate that BagFormer is able to achieve results
comparable to state-of-the-art single encoder models in cross-modal retrieval
tasks, while also offering efficient training and inference with 20.72 times
lower latency and 25.74 times higher throughput.
Authors' comments: 8 pages, 4 figures, 4 tables
Haoyu Tang, Jihua Zhu, Meng Liu, Member, IEEE, Zan Gao, Zhiyong Cheng
Video moment retrieval targets at retrieving a moment in a video for a given
language query. The challenges of this task include 1) the requirement of
localizing the relevant moment in an untrimmed video, and 2) bridging the
semantic gap between textual query and video contents. To tackle those
problems, early approaches adopt the sliding window or uniform sampling to
collect video clips first and then match each clip with the query. Obviously,
these strategies are time-consuming and often lead to unsatisfied accuracy in
localization due to the unpredictable length of the golden moment. To avoid the
limitations, researchers recently attempt to directly predict the relevant
moment boundaries without the requirement to generate video clips first. One
mainstream approach is to generate a multimodal feature vector for the target
query and video frames (e.g., concatenation) and then use a regression approach
upon the multimodal feature vector for boundary detection. Although some
progress has been achieved by this approach, we argue that those methods have
not well captured the cross-modal interactions between the query and video
frames.
In this paper, we propose an Attentive Cross-modal Relevance Matching (ACRM)
model which predicts the temporal boundaries based on an interaction modeling.
In addition, an attention module is introduced to assign higher weights to
query words with richer semantic cues, which are considered to be more
important for finding relevant video contents. Another contribution is that we
propose an additional predictor to utilize the internal frames in the model
training to improve the localization accuracy. Extensive experiments on two
datasets TACoS and Charades-STA demonstrate the superiority of our method over
several state-of-the-art methods. Ablation studies have been also conducted to
examine the effectiveness of different modules in our ACRM model.
Authors' comments: 12 pages; accepted by IEEE TMM
Zhipeng Zhang, Bing Li, Weiming Hu, Houwen Peng
The encoding of the target in object tracking moves from the coarse
bounding-box to fine-grained segmentation map recently. Revisiting de facto
real-time approaches that are capable of predicting mask during tracking, we
observed that they usually fork a light branch from the backbone network for
segmentation. Although efficient, directly fusing backbone features without
considering the negative influence of background clutter tends to introduce
false-negative predictions, lagging the segmentation accuracy. To mitigate this
problem, we propose an attention retrieval network (ARN) to perform soft
spatial constraints on backbone features. We first build a look-up-table (LUT)
with the ground-truth mask in the starting frame, and then retrieves the LUT to
obtain an attention map for spatial constraints. Moreover, we introduce a
multi-resolution multi-stage segmentation network (MMS) to further weaken the
influence of background clutter by reusing the predicted mask to filter
backbone features. Our approach set a new state-of-the-art on recent pixel-wise
object tracking benchmark VOT2020 while running at 40 fps. Notably, the
proposed model surpasses SiamMask by 11.7/4.2/5.5 points on VOT2020, DAVIS2016,
and DAVIS2017, respectively. We will release our code at
https://github.com/researchmm/TracKit.
Authors' comments: Some technical errors. We would provide new versions later
Yunhan Li, Mingjie Xie, Zihan Gong, Zeyang Shi, Gengshen Wu, Min Yang
Recent advances in embedding-based retrieval have enabled dense retrievers to serve as core infrastructure in many industrial systems, where a single retrieval backbone is often shared across multiple downstream applications. In such settings, retrieval quality directly constrains system performance and extensibility, while coupling model selection, deployment, and rollback decisions across applications.
In this paper, we present empirical findings and a system-level solution for optimizing retrieval components deployed as a shared backbone in production legal retrieval systems. We adopt a multi-stage optimization framework for dense retrievers and rerankers, and show that different retrieval components exhibit stage-dependent trade-offs. These observations motivate a component-wise, mixed-stage configuration rather than relying on a single uniformly optimal checkpoint. The resulting backbone is validated through end-to-end evaluation and deployed as a shared retrieval service supporting multiple industrial applications.
Authors' comments: 4 pages, 3 figures, 3 tables
Kyumin Lee, Minjin Jeon, Sanghwan Jang, Hwanjo Yu
Answering complex real-world questions requires step-by-step retrieval and
integration of relevant information to generate well-grounded responses.
However, existing knowledge distillation methods overlook the need for
different reasoning abilities at different steps, hindering transfer in
multi-step retrieval-augmented frameworks. To address this, we propose Stepwise
Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step
Retrieval-Augmented Language Models (StepER). StepER employs step-wise
supervision to align with evolving information and reasoning demands across
stages. Additionally, it incorporates difficulty-aware training to
progressively optimize learning by prioritizing suitable steps. Our method is
adaptable to various multi-step retrieval-augmented language models, including
those that use retrieval queries for reasoning paths or decomposed questions.
Extensive experiments show that StepER outperforms prior methods on multi-hop
QA benchmarks, with an 8B model achieving performance comparable to a 70B
teacher model.
Authors' comments: EMNLP 2025 Main
Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Yishan Li, Yukun Yan, Shuo Wang, Zhiyuan Liu, Yu Gu et al.
Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge during generation. Existing MRAG methods typically adopt a static retrieval pipeline that fetches relevant information from multiple Knowledge Bases (KBs), followed by a refinement step. However, these approaches overlook the reasoning and planning capabilities of MLLMs to dynamically determine how to interact with different KBs during the reasoning process. To address this limitation, we propose R1-Router, a novel MRAG framework that learns to decide when and where to retrieve knowledge based on the evolving reasoning state. Specifically, R1-Router can generate follow-up queries according to the current reasoning step, routing these intermediate queries to the most suitable KB, and integrating external knowledge into a coherent reasoning trajectory to answer the original query. Furthermore, we introduce Step-wise Group Relative Policy Optimization (Step-GRPO), a tailored reinforcement learning algorithm that assigns step-specific rewards to optimize the reasoning behavior of MLLMs. Experimental results on various open-domain QA benchmarks across multiple modalities demonstrate that R1-Router outperforms baseline models by over 7%. Further analysis shows that R1-Router can adaptively and effectively leverage diverse KBs, reducing unnecessary retrievals and improving both efficiency and accuracy.
Avideep Mukherjee, Soumya Banerjee, Piyush Rai, Vinay P. Namboodiri
Diffusion-based models demonstrate impressive generation capabilities. However, they also have a massive number of parameters, resulting in enormous model sizes, thus making them unsuitable for deployment on resource-constraint devices. Block-wise generation can be a promising alternative for designing compact-sized (parameter-efficient) deep generative models since the model can generate one block at a time instead of generating the whole image at once. However, block-wise generation is also considerably challenging because ensuring coherence across generated blocks can be non-trivial. To this end, we design a retrieval-augmented generation (RAG) approach and leverage the corresponding blocks of the images retrieved by the RAG module to condition the training and generation stages of a block-wise denoising diffusion model. Our conditioning schemes ensure coherence across the different blocks during training and, consequently, during generation. While we showcase our approach using the latent diffusion model (LDM) as the base model, it can be used with other variants of denoising diffusion models. We validate the solution of the coherence problem through the proposed approach by reporting substantive experiments to demonstrate our approach's effectiveness in compact model size and excellent generation quality.
Haichuan Hu, Yuhan Sun, Qunjun Zhang
Retrieval-Augmented Generation (RAG) has become a primary technique for mitigating hallucinations in large language models (LLMs). However, incomplete knowledge extraction and insufficient understanding can still mislead LLMs to produce irrelevant or even contradictory responses, which means hallucinations persist in RAG. In this paper, we propose LRP4RAG, a method based on the Layer-wise Relevance Propagation (LRP) algorithm for detecting hallucinations in RAG. Specifically, we first utilize LRP to compute the relevance between the input and output of the RAG generator. We then apply further extraction and resampling to the relevance matrix. The processed relevance data are input into multiple classifiers to determine whether the output contains hallucinations. To the best of our knowledge, this is the first time that LRP has been used for detecting RAG hallucinations, and extensive experiments demonstrate that LRP4RAG outperforms existing baselines.
Tuan-Luc Huynh, Thuy-Trang Vu, Weiqing Wang, Yinwei Wei, Trung Le, Dragan Gasevic, Yuan-Fang Li, Thanh-Toan Do
Differentiable Search Index (DSI) utilizes Pre-trained Language Models (PLMs)
for efficient document retrieval without relying on external indexes. However,
DSIs need full re-training to handle updates in dynamic corpora, causing
significant computational inefficiencies. We introduce PromptDSI, a
rehearsal-free, prompt-based approach for instance-wise incremental learning in
document retrieval. PromptDSI attaches prompts to the frozen PLM's encoder of
DSI, leveraging its powerful representation to efficiently index new corpora
while maintaining a balance between stability and plasticity. We eliminate the
initial forward pass of prompt-based continual learning methods that doubles
training and inference time. Moreover, we propose a topic-aware prompt pool
that employs neural topic embeddings as fixed keys. This strategy ensures
diverse and effective prompt usage, addressing the challenge of parameter
underutilization caused by the collapse of the query-key matching mechanism.
Our empirical evaluations demonstrate that PromptDSI matches IncDSI in managing
forgetting while significantly enhancing recall by over 4% on new corpora.
Authors' comments: 21 pages
Ruiwen Zhou, Yingxuan Yang, Muning Wen, Ying Wen, Wenhao Wang, Chunling Xi, Guoqiang Xu, Yong Yu et al.
Numerous large language model (LLM) agents have been built for different
tasks like web navigation and online shopping due to LLM's wide knowledge and
text-understanding ability. Among these works, many of them utilize in-context
examples to achieve generalization without the need for fine-tuning, while few
of them have considered the problem of how to select and effectively utilize
these examples. Recently, methods based on trajectory-level retrieval with task
meta-data and using trajectories as in-context examples have been proposed to
improve the agent's overall performance in some sequential decision making
tasks. However, these methods can be problematic due to plausible examples
retrieved without task-specific state transition dynamics and long input with
plenty of irrelevant context. In this paper, we propose a novel framework
(TRAD) to address these issues. TRAD first conducts Thought Retrieval,
achieving step-level demonstration selection via thought matching, leading to
more helpful demonstrations and less irrelevant input noise. Then, TRAD
introduces Aligned Decision, complementing retrieved demonstration steps with
their previous or subsequent steps, which enables tolerance for imperfect
thought and provides a choice for balance between more context and less noise.
Extensive experiments on ALFWorld and Mind2Web benchmarks show that TRAD not
only outperforms state-of-the-art models but also effectively helps in reducing
noise and promoting generalization. Furthermore, TRAD has been deployed in
real-world scenarios of a global business insurance company and improves the
success rate of robotic process automation.
Authors' comments: Codes available at: https://github.com/skyriver-2000/TRAD-Official
Chunyu Li, Jiajia Ding, Xing hu, Fan Wang
Long Document retrieval (DR) has always been a tremendous challenge for reading comprehension and information retrieval. The pre-training model has achieved good results in the retrieval stage and Ranking for long documents in recent years. However, there is still some crucial problem in long document ranking, such as data label noises, long document representations, negative data Unbalanced sampling, etc. To eliminate the noise of labeled data and to be able to sample the long documents in the search reasonably negatively, we propose the bag sampling method and the group-wise Localized Contrastive Estimation(LCE) method. We use the head middle tail passage for the long document to encode the long document, and in the retrieval, stage Use dense retrieval to generate the candidate's data. The retrieval data is divided into multiple bags at the ranking stage, and negative samples are selected in each bag. After sampling, two losses are combined. The first loss is LCE. To fit bag sampling well, after query and document are encoded, the global features of each group are extracted by convolutional layer and max-pooling to improve the model's resistance to the impact of labeling noise, finally, calculate the LCE group-wise loss. Notably, our model shows excellent performance on the MS MARCO Long document ranking leaderboard.
Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu
Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Techniques for integrating external data into LLMs, such as Retrieval-Augmented Generation (RAG) and fine-tuning, are gaining increasing attention and widespread application. Nonetheless, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges. These challenges encompass a wide range of issues, from retrieving relevant data and accurately interpreting user intent to fully harnessing the reasoning capabilities of LLMs for complex tasks. We believe that there is no one-size-fits-all solution for data-augmented LLM applications. In practice, underperformance often arises from a failure to correctly identify the core focus of a task or because the task inherently requires a blend of multiple capabilities that must be disentangled for better resolution. In this survey, we propose a RAG task categorization method, classifying user queries into four levels based on the type of external data required and primary focus of the task: explicit fact queries, implicit fact queries, interpretable rationale queries, and hidden rationale queries. We define these levels of queries, provide relevant datasets, and summarize the key challenges and most effective techniques for addressing these challenges. Finally, we discuss three main forms of integrating external data into LLMs: context, small model, and fine-tuning, highlighting their respective strengths, limitations, and the types of problems they are suited to solve. This work aims to help readers thoroughly understand and decompose the data requirements and key bottlenecks in building LLM applications, offering solutions to the different challenges and serving as a guide to systematically developing such applications.
Harshil Kothari, Michael C. Cushing, Ben Burningham, Samuel A. Beiler, J. Davy Kirkpatrick, Adam C. Schneider, Sagnick Mukherjee, Mark S. Marley
We present an atmospheric retrieval analysis of the Y0 brown dwarf WISE J035934.06$-$540154.6 using the low-resolution 0.96--12 $\mu$m JWST spectrum presented in \citet{Beiler_2023}. We obtain volume number mixing ratios of the major gas-phase absorbers (H$_2$O, CH$_4$, CO, CO$_2$, PH$_3$, and H$_2$S) that are 3--5$\times$ more precise than previous work that used HST spectra. We also find an order-of-magnitude improvement in the precision of the retrieved thermal profile, a direct result of the broad wavelength coverage of the JWST data. We used the retrieved thermal profile and surface gravity to generate a grid of chemical forward models with varying metallicity, (C/O)$_\textrm{atm}$, and strengths of vertical mixing as encapsulated by the eddy diffusion coefficient $K_\textrm{zz}$. Comparison of the retrieved abundances with this grid of models suggests that the deep atmosphere of WISE 0359$-$54 shows signs of vigorous vertical mixing with $K_\textrm{zz}=10^9$ [cm$^{2}$ s$^{-1}$]. To test the sensitivity of these results to our 5-knot spline thermal profile model, we performed a second retrieval using the \citet{Madhusudhan_2009} thermal profile model. While the results of the two retrievals generally agree well, we do find differences between the retrieved values of mass and volume number mixing ratio of H$_2$S with fractional differences of the median values of $-$0.64 and $-$0.10, respectively. In addition, the 5-knot thermal profile is consistently warmer at pressure between 1 and 70 bar. Nevertheless, our results underscore the power that the broad-wavelength infrared spectra obtainable with the James Webb Space Telescope have to characterize the atmospheres of cool brown dwarfs.
Qi-yue Yu, Shi-wen Lin, Ting-wei Yang
A finite-field multiple-access (FFMA) system separates users within a finite
field by utilizing different element-pairs (EPs) as virtual resources. The
Cartesian product of distinct EPs forms an EP code, which serves as the input
to a finite-field multiplexing module (FF-MUX). This allows the FFMA technique
to reorder the channel coding and multiplexing modules, enabling the
superimposed signals to function as codewords that can be decoded by a channel
code. This flexibility allows the FFMA system to efficiently support a large
number of users with short packet traffic, addressing the finite blocklength
(FBL) challenge in multiuser reliable transmission. Designing EP codes is a
central challenge in FFMA systems. In this paper, we construct EP codes based
on a bit(s)-to-codeword transformation approach and define the corresponding EP
code as a codeword-wise EP (CWEP) code. We then investigate the encoding
process of EP codes, and propose unique sum-pattern mapping (USPM) structural
property constraints to design uniquely decodable CWEP codes. Next, we present
the $\kappa$-fold ternary orthogonal matrix ${\bf T}_{\rm o}(2^{\kappa},
2^{\kappa})$ over GF$(3^m)$, where $m = 2^{\kappa}$, and the ternary
non-orthogonal matrix ${\bf T}_{\rm no}(M,m)$ over GF$(3^m)$, for constructing
specific CWEP codes. Based on the proposed CWEP codes, we introduce three FFMA
modes: channel codeword multiple access (FF-CCMA), code division multiple
access (FF-CDMA), and non-orthogonal multiple access (FF-NOMA). Simulation
results demonstrate that all three modes effectively support massive user
transmissions with well-behaved error performance.
Authors' comments: 50 pages, 9 figures
Belhal Karimi, Ping Li, Xiaoyun Li
In the emerging paradigm of Federated Learning (FL), large amount of clients such as mobile devices are used to train possibly high-dimensional models on their respective data. Combining (dimension-wise) adaptive gradient methods (e.g. Adam, AMSGrad) with FL has been an active direction, which is shown to outperform traditional SGD based FL in many cases. In this paper, we focus on the problem of training federated deep neural networks, and propose a novel FL framework which further introduces layer-wise adaptivity to the local model updates. Our framework can be applied to locally adaptive FL methods including two recent algorithms, Mime and Fed-AMS. Theoretically, we provide a convergence analysis of our layer-wise FL methods, coined Fed-LAMB and Mime-LAMB, which matches the convergence rate of state-of-the-art results in FL and exhibits linear speedup in terms of the number of workers. Experimental results on various datasets and models, under both IID and non-IID local data settings, show that both Fed-LAMB and Mime-LAMB achieve faster convergence speed and better generalization performance, compared to the various recent adaptive FL methods.
D. J. Pinfield, J. Gomes, A. C. Day-Jones, S. K. Leggett, M. Gromadzki, B. Burningham, M. T. Ruiz, R. Kurtev et al.
A method is defined for identifying late T and Y dwarfs in WISE down to low
values of signal-to-noise. This requires a WISE detection only in the W2-band
and uses the statistical properties of the WISE multi-frame measurements and
profile fit photometry to reject contamination resulting from non-point-like
objects, variables and moving sources. To trace our desired parameter space we
use a control sample of isolated non-moving non-variable point sources from the
SDSS, and identify a sample of 158 WISE W2-only candidates down to a
signal-to-noise limit of 8. For signal-to-noise ranges >10 and 8-10
respectively, ~45% and ~90% of our sample fall outside the selection criteria
published by the WISE team (Kirkpatrick et al. 2012), due mainly to the type of
constraints placed on the number of individual W2 detections. We present
follow-up of eight candidates and identify WISE 0013+0634 and WISE 0833+0052,
T8 and T9 dwarfs with high proper motion (~1.3 and ~1.8 arcsec/yr). Both
objects show a mid-infrared/near-infrared excess of ~1-1.5 magnitudes, and are
K-band suppressed. Distance estimates lead to space motion constraints that
suggest halo (or at least thick disk) kinematics. We then assess the reduced
proper motion diagram of WISE ultracool dwarfs, which suggests that late T and
Y dwarfs may have a higher thick-disk/halo population fraction than earlier
objects.
Authors' comments: Accepted for publication in MNRAS
Musa Cim, Burak Topcu, Mahmut Taylan Kandemir
Quantization addresses the high resource demand for large language models (LLMs) by alleviating memory pressure and bandwidth congestion and providing significantly scaled compute power with a tolerable impact on accuracy. Four-bit floating point (FP4), the lowest-precision format that preserves essential numerical properties such as exponent and sign, has begun to be adopted in cutting-edge architectures, including Blackwell and AMD CDNA, to support LLM quantization and reduce deployment costs. Although aggressive quantization can yield efficiency gains, the quantization sensitivity of within-transformer layers and whether these sensitivities generalize across existing FP4 formats and model scales remain underexplored. To elucidate quantization sensitivity, this study conducts a systematic analysis of two FP4 formats, MXFP4 and NVFP4, across three Qwen2.5 model scales (0.5B, 7B, and 14B), using controlled component-wise and block-wise isolation methodologies. We observe that MLP up- and down-projection layers consistently dominate in terms of sensitivity, while gate and attention projections are moderately and substantially less sensitive to FP4 quantization, respectively. We further find that sensitivity does not universally localize to the final blocks, but early blocks can be highly sensitive, particularly under MXFP4. Our results provide a diagnostic characterization of the inference behavior of FP4 across components, depths, and FP4 formats.
Yinghui Zhang, Tailin Chen, Yuchen Zhang, Zeyu Fu
The rapid rise of video content on platforms such as TikTok and YouTube has
transformed information dissemination, but it has also facilitated the spread
of harmful content, particularly hate videos. Despite significant efforts to
combat hate speech, detecting these videos remains challenging due to their
often implicit nature. Current detection methods primarily rely on unimodal
approaches, which inadequately capture the complementary features across
different modalities. While multimodal techniques offer a broader perspective,
many fail to effectively integrate temporal dynamics and modality-wise
interactions essential for identifying nuanced hate content. In this paper, we
present CMFusion, an enhanced multimodal hate video detection model utilizing a
novel Channel-wise and Modality-wise Fusion Mechanism. CMFusion first extracts
features from text, audio, and video modalities using pre-trained models and
then incorporates a temporal cross-attention mechanism to capture dependencies
between video and audio streams. The learned features are then processed by
channel-wise and modality-wise fusion modules to obtain informative
representations of videos. Our extensive experiments on a real-world dataset
demonstrate that CMFusion significantly outperforms five widely used baselines
in terms of accuracy, precision, recall, and F1 score. Comprehensive ablation
studies and parameter analyses further validate our design choices,
highlighting the model's effectiveness in detecting hate videos. The source
codes will be made publicly available at https://github.com/EvelynZ10/cmfusion.
Authors' comments: ICDMW 2024, Github: https://github.com/EvelynZ10/cmfusion
Patrick Blumenberg, Thomas Graave, Tim Fingscheidt
Large language models (LLMs) demand extensive memory capacity during both fine-tuning and inference. To enable memory-efficient fine-tuning, existing methods apply block-wise quantization techniques, such as NF4 and AF4, to the network weights. We show that these quantization techniques incur suboptimal quantization errors. Therefore, as a first novelty, we propose an optimization approach for block-wise quantization. Using this method, we design a family of quantizers named 4-bit block-wise optimal float (BOF4), which consistently reduces the quantization error compared to both baseline methods. We provide both a theoretical and a data-driven solution for the optimization process and prove their practical equivalence. Secondly, we propose a modification to the employed normalization method based on the signed absolute block maximum (BOF4-S), enabling further reduction of the quantization error and empirically achieving less degradation in language modeling performance. Thirdly, we explore additional variations of block-wise quantization methods applied to LLMs through an experimental study on the importance of accurately representing zero and large-amplitude weights on the one hand, and optimization towards various error metrics on the other hand. Lastly, we introduce a mixed-precision quantization strategy dubbed outlier-preserving quantization (OPQ) to address the distributional mismatch induced by outlier weights in block-wise quantization. By storing outlier weights in 16-bit precision (OPQ) while applying BOF4-S, we achieve top performance among 4-bit block-wise quantization techniques w.r.t. perplexity.