Minghan Li, Eric Gaussier
Recent studies have demonstrated that the ability of dense retrieval models
to generalize to target domains with different distributions is limited, which
contrasts with the results obtained with interaction-based models. Prior
attempts to mitigate this challenge involved leveraging adversarial learning
and query generation approaches, but both approaches nevertheless resulted in
limited improvements. In this paper, we propose to combine the query-generation
approach with a self-supervision approach in which pseudo-relevance labels are
automatically generated on the target domain. To accomplish this, a T5-3B model
is utilized for pseudo-positive labeling, and meticulous hard negatives are
chosen. We also apply this strategy on conversational dense retrieval model for
conversational search. A similar pseudo-labeling approach is used, but with the
addition of a query-rewriting module to rewrite conversational queries for
subsequent labeling. This proposed approach enables a model's domain adaptation
with real queries and documents from the target dataset. Experiments on
standard dense retrieval and conversational dense retrieval models both
demonstrate improvements on baseline models when they are fine-tuned on the
pseudo-relevance labeled data.
Authors' comments: 12 pages, accepted by COLING 2024
Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim
In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance
Retrieval Challenge 2022. We first parse sentences into semantic roles
corresponding to verbs and nouns; then utilize self-attentions to exploit
semantic role contextualized video features along with textual features via
triplet losses in multiple embedding spaces. Our method overpasses the strong
baseline in normalized Discounted Cumulative Gain (nDCG), which is more
valuable for semantic similarity. Our submission is ranked 3rd for nDCG and
ranked 4th for mAP.
Authors' comments: Ranked joint 3rd place in the Multi-Instance Retrieval Challenge at
EPIC@CVPR2022. (v2: ref error is corrected)
Debadutta Patra, Ayush Bardhan Tripathy, Soumya Ranjan Sahu, Sucheta Panda
Digital twin technology, when combined with physics-informed machine learning with simulation results of Aspen, offers transformative capabilities for industrial process monitoring, control, and optimization. In this work, the proposed model presents a Physics-Informed Neural Network (PINN) digital twin framework for the dynamic, tray-wise modeling of binary distillation columns operating under transient conditions. The architecture of the proposed model embeds fundamental thermodynamic constraints, including vapor-liquid equilibrium (VLE) described by modified Raoult's law, tray-level mass and energy balances, and the McCabe-Thiele graphical methodology directly into the neural network loss function via physics residual terms. The model is trained and evaluated on a high-fidelity synthetic dataset of 961 timestamped measurements spanning 8 hours of transient operation, generated in Aspen HYSYS for a binary HX/TX distillation system comprising 16 sensor streams. An adaptive loss-weighting scheme balances the data fidelity and physics consistency objectives during training. Compared to five data-driven baselines (LSTM, vanilla MLP, GRU, Transformer, DeepONet), the proposed PINN achieves an RMSE of 0.00143 for HX mole fraction prediction (R^2 = 0.9887), representing a 44.6% reduction over the best data-only baseline, while strictly satisfying thermodynamic constraints. Tray-wise temperature and composition profiles predicted under transient perturbations demonstrate that the digital twin accurately captures column dynamics including feed tray responses, reflux ratio variations, and pressure transients. These results establish the proposed PINN digital twin as a robust foundation for real-time soft sensing, model-predictive control, and anomaly detection in industrial distillation processes.
Authors' comments: 17 pages, 10 figures
Jiayi Luo, Jiayu Chen, Jiankun Wang, Cong Wang, Hanxin Zhu, Qingyun Sun, Chen Gao, Zhibo Chen et al.
Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to $1.93\times$ speedup while maintaining a PSNR of up to 29 dB on Wan2.1.
Hengyuan Zhang, Xinrong Chen, Zunhai Su, Xiao Liang, Jing Xiong, Wendong Xu, He Xiao, Chaofan Tao et al.
Layer-wise mixed-precision quantization (LMPQ) enables effective compression under extreme low-bit settings by allocating higher precision to sensitive layers. However, existing methods typically treat all intra-layer weight modules uniformly and rely on a single numerical property when estimating sensitivity, overlooking their distinct operational roles and structural characteristics. To address this, we propose NSDS, a novel calibration-free LMPQ framework driven by Numerical and Structural Dual-Sensitivity. Specifically, it first mechanistically decomposes each layer into distinct operational roles and quantifies their sensitivity from both numerical and structural perspectives. These dual-aspect scores are then aggregated into a unified layer-wise metric through a robust aggregation scheme based on MAD-Sigmoid and Soft-OR to guide bit allocation. Extensive experiments demonstrate that NSDS consistently achieves superior performance compared to various baselines across diverse models and downstream tasks, without relying on any calibration data.
Dale Julson, Eric Reinhardt, Andrii Krutsylo, Resham Sohal, Guillermo Fidalgo, Sergei Gleyzer, Emanuele Usai, The CMS HCAL Collaboration
Machine learning (ML) techniques have been demonstrated to improve the accuracy and efficiency of anomaly detection (AD) when compared to conventional methods. This has led to the adoption of ML for data quality monitoring (DQM) use cases in order to monitor the operation of certain systems to ensure that they are free of undesirable or potentially deleterious anomalies. For applications in the field of High-Energy physics (HEP), where detectors must operate in long-running, harsh environments, ML models used in DQM that have been trained on static datasets are bound to experience degraded performance due to distributional shifts that naturally occur in the incoming data streams, unless directly mitigated via the inclusion of continual ML techniques. This work introduces DepthViT, a lightweight masked autoencoder architecture that employs unique depth-wise embeddings and cross-depth attention, to perform computationally efficient AD tasks. A continual learning framework is developed in which DepthViT models trained on the most recent data streams are ensembled with older models to create a robust overall system which is more resilient to shifts in incoming data streams. When evaluated on occupancy maps from the Compact Muon Solenoid (CMS) hadron calorimeter across multiple data-taking campaigns, the proposed method maintains precision above 99\% and stable ratio of correct anomaly predictions to number of anomalies both under small and large distributional shifts. Beyond HEP, the same ensembling-based continual adaptation strategy can be directly applied to industrial monitoring environments where data also naturally evolve over time. This work therefore presents a path toward adaptive anomaly detection systems capable of sustained operation in dynamic data environments.
Gökdeniz Gülmez
Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling neural networks while maintaining computational efficiency. However, standard MoE implementations rely on two rigid design assumptions: (1) fixed Top-K routing where exactly K experts are activated per token, and (2) uniform expert allocation across all layers. This paper introduces DynaMoE, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation. DynaMoE introduces a principled routing mechanism where the number of active experts per token varies based on input complexity. Concurrently, the framework implements six distinct scheduling strategies for distributing expert capacity across network depth, including descending, ascending, pyramid, and wave patterns. We theoretically analyze the expressivity gains of dynamic routing and derive bounds on computational efficiency. Through extensive experiments on MNIST, Fashion-MNIST, CIFAR-10 (image classification), and Recycling-the-Web (language modeling) across multiple model scales, we demonstrate that DynaMoE achieves superior parameter efficiency compared to static baselines. Our key finding is that optimal expert schedules are task- and scale-dependent: descending schedules (concentrating capacity in early layers) outperform uniform baselines on image classification. For language modeling, optimal schedules vary by model size, descending for Tiny, ascending for Small, and uniform for Medium. Furthermore, dynamic routing reduces gradient variance during training, leading to improved convergence stability. DynaMoE establishes a new framework for adaptive computation in neural networks, providing principled guidance for MoE architecture design.
Junuk Cha, Jihyeon Kim, Han-Mu Park
Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signing-hand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. Codes and data are available in: https://github.com/JunukCha/OpenFS.
Authors' comments: Accepted to CVPR 2026
Jingyu Xiao, Jiantong Qin, Shuoqi Li, Man Ho Lam, Yuxuan Wan, Jen-tse Huang, Yintong Huo, Michael R. Lyu
Multimodal Large Language Models (MLLMs) have demonstrated strong performance on the UI-to-code task, which aims to generate UI code from design mock-ups. However, when applied to long and complex websites, they often struggle with fragmented segmentation, redundant code generation for repetitive components, and frequent UI inconsistencies. To systematically investigate and address these challenges, we introduce ComUIBench, a new multi-page complex webpage benchmark with component annotations, designed to evaluate MLLMs' ability to generate reusable UI code in realistic website scenarios. Building upon this benchmark, we propose ComUICoder, a component-based UI code generation framework that emphasizes semantic-aware segmentation, code reuse, and fine-grained refinement. Specifically, ComUICoder incorporates (1) Hybrid Semantic-aware Block Segmentation for accurate UI semantic coherent block detection, (2) Visual-aware Graph-based Block Merge to consolidate structurally similar components within and across webpages for reusable implementation, and (3) Priority-based Element-wise Feedback to refine generated code and reduce element-level inconsistencies. Extensive experiments demonstrate that ComUICoder significantly improves overall generation quality and code reusability on complex multipage websites. Our datasets and code are publicly available at https://github.com/WebPAI/ComUICoder.
Jihun Kim, Javad Lavaei
The least-squares estimator has achieved considerable success in learning linear dynamical systems from a single trajectory of length $T$. While it attains an optimal error of $\mathcal{O}(1/\sqrt{T})$ under independent zero-mean noise, it lacks robustness and is particularly susceptible to adversarial corruption. In this paper, we consider the identification of a networked system in which every node is subject to both noise and adversarial attacks. We assume that every node is independently corrupted with probability smaller than $0.5$ at each time, placing the overall system under almost-persistent local attack. We first show that no convex one-stage estimator can achieve a consistent estimate as $T$ grows under both noise and attacks. This motivates the development of a two-stage estimation method applied across nodes. In Stage I, we leverage the $\ell_1$-norm estimator and derive an estimation error bound proportional to the noise level $σ_w$. This bound is subsequently used to detect and filter out attacks, producing a clean dataset for each node, to which we apply the least-squares estimator in Stage II. The resulting estimation error is on the order $\mathcal{O}(1/\sqrt{T})$ plus the product of $σ_w$ and the number of misclassifications. In the event of perfect separability between attack and non-attack data, which occurs when injected attacks are sufficiently large relative to the noise scale, our two-stage estimator is consistent for the true system.
Authors' comments: 33 pages
A. Trudeau, Anthony H. Gonzalez, K. Thongkham, M. Brodwin, Thomas Connor, Peter R. M. Eisenhardt, Emily Moravec, S. A. Stanford et al.
The splashback radius, the radius of the apocenter of the first orbit of infalling material, is a measurable quantity marking the boundary between a galaxy cluster and its infalling region. We report detections of splashback radii in total light stacks, i.e. image stacks centered on the cores of galaxy clusters. Our analysis uses Wide-field Infrared Survey Explorer (WISE) W1 and W2 images of 83,345 candidate clusters at $0.5 \lesssim z \lesssim 1.9$ from the Massive and Distant Clusters of WISE Survey 2 (MaDCoWS2). The clusters are organized in stacks by redshift and signal-to-noise ($S\slash N$) ratios. We adopt a statistical approach, using 1000 bootstrap realizations to determine the median projected splashback radius and its confidence interval in a given bin. We compare our splashback radii with the measurements made by K. Thongkham et al. on a similar sample of MaDCoWS2 clusters using galaxy-cluster cross-correlation and find that they are consistent, although our method yields larger error bars. Our main systematic error is the accuracy of the background subtraction, but its impact remains small: the consistency of K. Thongkham et al. and our results suggests that neither method suffers from large systematics. The sensitivity of total light stacking to the contribution of faint galaxies can be advantageous to locate splashback radii when only the brightest galaxies are detected in individual images, such as at high redshifts. We present a potential application of this new technique to probe the evolution of the stellar mass in cluster infalling regions.
Authors' comments: Accepted by ApJ; 16 pages, 6 figures, 3 table
Khunanon Thongkham, Anthony H. Gonzalez, Mark Brodwin, Ariane Trudeau, Peter Eisenhardt, S. A. Stanford, Emily Moravec, Thomas Connor et al.
The Massive and Distant Clusters of WISE Survey 2 (MaDCoWS2) is a WISE-selected catalog of galaxy clusters at $0.1<z<2$ covering an effective area of $>6000$ deg$^2$. In this paper, we derive splashback radii for this cluster ensemble from galaxy density profiles and constrain the mass threshold of the survey as a function of redshift. We use MaDCoWS2 cluster candidates at $0.4\leq z \leq 1.65$ divided into subsamples with different signal-to-noise (S/N$_{\rm P}$) and redshifts, cross-correlated with galaxies from the CatWISE2020 catalog, to obtain average surface density profiles. We perform a Markov Chain Monte Carlo analysis to derive parameter estimates for theoretical models consisting of orbiting and infalling terms. A distinct splashback feature is detected in all subsamples. The measured splashback radii span from $0.89^{+0.02}_{-0.02}h^{-1}$ comoving Mpc/cMpc ($0.61^{+0.02}_{-0.02}h^{-1}$ proper Mpc/pMpc) at $\overline{z}=0.45$ to $1.27^{+0.05}_{-0.05}h^{-1}$ cMpc ($0.53^{+0.04}_{-0.04}h^{-1}$ pMpc) at $\overline{z}=1.54$. We also find that splashback radii increase with $S/N_{\rm P}$ at fixed redshift. The resultant splashback radii constrain the redshift dependence of the mass of MaDCoWS2 clusters at fixed $S/N_{\rm P}$. We calculate $M_{\rm 200m}$ from the radii using a relation based on a cosmological simulation. MaDCoWS2 $M_{\rm 200m}$ values derived from the simulation-based relation are lower than the expected values based on weak-lensing observations. More robust mass constraints will come from calibrating splashback radii derived from galaxy density profiles with weak lensing shear profiles from facilities such as $\textit{Euclid}$, Rubin, and $\textit{Roman}$.
Authors' comments: 22 pages, 7 figures, 9 tables, submitted to ApJ
Lei Deng, Wenhao Huang, Chao Yang, Haoyuan Zheng, Yinbin Tian, Yue Ma
Defect depth quantification in additively manufactured (AM) components remains a significant challenge for non-destructive testing (NDT). This study proposes a Pixel-wise Quantitative Thermography Neural Network (PQT-Net) to address this challenge for polylactic acid (PLA) parts. A key innovation is a novel data augmentation strategy that reconstructs thermal sequence data into two-dimensional stripe images, preserving the complete temporal evolution of heat diffusion for each pixel. The PQT-Net architecture incorporates a pre-trained EfficientNetV2-S backbone and a custom Residual Regression Head (RRH) with learnable parameters to refine outputs. Comparative experiments demonstrate the superiority of PQT-Net over other deep learning models, achieving a minimum Mean Absolute Error (MAE) of 0.0094 mm and a coefficient of determination (R) exceeding 99%. The high precision of PQT-Net underscores its potential for robust quantitative defect characterization in AM.
Authors' comments: Under review
Adel Javanmard, David P. Woodruff
The Courtade-Kumar conjecture posits that dictatorship functions maximize the mutual information between the function's output and a noisy version of its input over the Boolean hypercube. We present two significant advancements related to this conjecture. First, we resolve an open question posed by Courtade and Kumar, proving that for any Boolean function (regardless of bias), the sum of mutual information between the function's output and the individual noisy input coordinates is bounded by $1-H(α)$, where $α$ is the noise parameter of the Binary Symmetric Channel. This generalizes their previous result which was restricted to balanced Boolean functions. Second, we advance the study of the main conjecture in the high noise regime. We establish an optimal error bound of $O(λ^2)$ for the asymptotic entropy expansion, where $λ= (1-2α)^2$, improving upon the previous best-known bounds. This refined analysis leads to a sharp, linear Fourier concentration bound for highly informative functions and significantly extends the range of the noise parameter $λ$ for which the conjecture is proven to hold.
Authors' comments: 16 pages
Tomoki Kubo, Ryuken Uda, Yusuke Iida
Deep double descent is one of the key phenomena underlying the generalization capability of deep learning models. In this study, epoch-wise double descent, which is delayed generalization following overfitting, was empirically investigated by focusing on the evolution of internal structures. Fully connected neural networks of three different sizes were trained on the CIFAR-10 dataset with 30% label noise. By decomposing the loss curves into signal contributions from clean and noisy training data, the epoch-wise evolutions of internal signals were analyzed separately. Three main findings were obtained from this analysis. First, the model achieved strong re-generalization on test data even after perfectly fitting noisy training data during the double descent phase, corresponding to a "benign overfitting" state. Second, noisy data were learned after clean data, and as learning progressed, their corresponding internal activations became increasingly separated in outer layers; this enabled the model to overfit only noisy data. Third, a single, very large activation emerged in the shallow layer across all models; this phenomenon is referred as "outliers," "massive activa-tions," and "super activations" in recent large language models and evolves with re-generalization. The magnitude of large activation correlated with input patterns but not with output patterns. These empirical findings directly link the recent key phenomena of "deep double descent," "benign overfitting," and "large activation", and support the proposal of a novel scenario for understanding deep double descent.
Authors' comments: 17 pages, 9 figures
Prasanjit Dubey, Xiaoming Huo
Identifying the most powerful test in multiple hypothesis testing under strong family-wise error rate (FWER) control is a fundamental problem in statistical methodology. State-of-the-art approaches formulate this as a constrained optimisation problem, for which a dual problem with strong duality has been established in a general sense. However, a constructive method for solving the dual problem is lacking, leaving a significant computational gap. This paper fills this gap by deriving novel, necessary optimality conditions for the dual optimisation. We show that these conditions motivate an efficient coordinate-wise algorithm for computing the optimal dual solution, which, in turn, provides the most powerful test for the primal problem. We prove the linear convergence of our algorithm, i.e., the computational complexity of our proposed algorithm is proportional to the logarithm of the reciprocal of the target error. To the best of our knowledge, this is the first time such a fast and computationally efficient algorithm has been proposed for finding the most powerful test with family-wise error rate control. The method's superior power is demonstrated through simulation studies, and its practical utility is shown by identifying new, significant findings in both clinical and financial data applications.
Hossein Teimoori Faal, Hasan Khodakarami
We study the Pascal determinantal arrays $\PD_k$, whose entries $\PD_k(i,j)$ are the $k\times k$ minors of the lower-triangular Pascal matrix $P=( \binom{a}{b} )_{a,b\ge 0}$. We prove an exact factorization of the row-wise log-concavity operator: \[ \LC(\PD_k)=\PD_{k-1}\Had\PD_{k+1}, \] where $\LC(a)_j=a_j^2-a_{j-1}a_{j+1}$ and $\Had$ denotes the Hadamard (entrywise) product. This identity is established by an elementary manipulation of the Desnanot--Jacobi (Dodgson) identity in two adjacent positions. We further prove a general inequality asserting that the log-concavity operator is submultiplicative under Hadamard products of log-concave arrays: $\LC(A\Had X)\ge\LC(A)\Had\LC(X)$. Combining the factorization with this inequality yields a uniform algebraic proof that every row of every array $\PD_k$ ($k\ge 1$) is infinitely log-concave, extending the celebrated theorem of McNamara and Sagan from Pascal's triangle ($\PD_1$) to the entire determinantal hierarchy. Applications include the log-convexity of $\{\PD_k(i,j)\}_{k\ge 0}$ in the determinantal order $k$ and a family of determinantal Hadamard inequalities.
Barbara Steffen, Edward A. Lee, Moshe Y. Vardi, Bernhard Steffen
Artificial intelligence (AI) is no longer futuristic; it is a daily companion shaping our private and work lives. While AI simplifies our lives, its rise also invites us to rethink who we are - and who we wish to remain - as humans. Even if AI does not think, feel, or desire, it learns from our behavior, mirroring our collective values, biases, and aspirations. The question, then, is not what AI is, but what we are allowing it to become through data, computing power, and other parameters "teaching" it - and, even more importantly, who we are becoming through our relationship with AI. As the EU AI Act and the Vienna Manifesto on Digital Humanism emphasize, technology must serve human dignity,social well-being, and democratic accountability. In our opinion, responsible use of AI is not only a matter of code nor law, but also of conscientious practice: how each of us engages and teaches others to use AI at home and at work. We propose Ten Commandments for the Wise and Responsible Use of AI are meant as guideline for this very engagement. They closely align with Floridi and Cowls' five guiding principles for AI in society - beneficence, non-maleficence, autonomy, justice, and explicability.
Anuab Sen, Mir Sayeed Mohammad, Saibal Mukhopadhyay
We introduce SSMRadNet, the first multi-scale State Space Model (SSM) based detector for Frequency Modulated Continuous Wave (FMCW) radar that sequentially processes raw ADC samples through two SSMs. One SSM learns a chirp-wise feature by sequentially processing samples from all receiver channels within one chirp, and a second SSM learns a representation of a frame by sequentially processing chirp-wise features. The latent representations of a radar frame are decoded to perform segmentation and detection tasks. Comprehensive evaluations on the RADIal dataset show SSMRadNet has 10-33x fewer parameters and 60-88x less computation (GFLOPs) while being 3.7x faster than state-of-the-art transformer and convolution-based radar detectors at competitive performance for segmentation tasks.
Tsung-En Lin, Kuan-Yi Lee, Hung-Yi Lee
Large Audio-Language Models and Multi-Modal Large Language Models have demonstrated strong capabilities in tasks such as Audio Question Answering (AQA), Audio Captioning, and Automatic Speech Recognition (ASR). However, there is growing evidence that these models can hallucinate about the content of the audio. To address this issue, we probe the models' internal states and propose Adaptive Vector Steering (AVS), a method that better grounds generation in audio content. We also identify a strong correlation between output correctness and internal representations. Experiments show consistent performance gains across two models and two benchmarks. On the Audio Hallucination QA dataset, our method boosts the F1-score of Gemma from 0.550 to 0.619 and Qwen from 0.626 to 0.632. Furthermore, our method increases the accuracy of Qwen on MMAU from 0.548 to 0.592, marking an 8% relative increase. To the best of our knowledge, this is the first work to apply vector steering to mitigate hallucination in audio.
Authors' comments: Note: This preprint is a version of the paper submitted to ICASSP 2026. The author list here includes contributors who provided additional supervision and guidance. The official ICASSP submission may differ slightly in author composition