Pierre Lelièvre, Chien-Chung Chen
Attribution methods are primarily designed to study the distribution of input
component contributions to individual model predictions. However, some research
applications require a summary of attribution patterns across the entire
dataset to facilitate the interpretability of the scrutinized models. In this
paper, we present a new method called Integrated Gradient Correlation (IGC)
that relates dataset-wise attributions to a model prediction score and enables
region-specific analysis by a direct summation over associated components. We
demonstrate our method on scalar predictions with the study of image feature
representation in the brain from fMRI neural signals and the estimation of
neural population receptive fields (NSD dataset), as well as on categorical
predictions with the investigation of handwritten digit recognition (MNIST
dataset). The resulting IGC attributions show selective patterns, revealing
underlying model strategies coherent with their respective objectives.
Authors' comments: 12 pages, 8 figures, source code at
https://github.com/plelievre/int_grad_corr.git
Wencheng Zhu, Xin Zhou, Pengfei Zhu, Yu Wang, Qinghua Hu
In this paper, we propose a simple yet effective contrastive knowledge distillation framework that achieves sample-wise logit alignment while preserving semantic consistency. Conventional knowledge distillation approaches exhibit over-reliance on feature similarity per sample, which risks overfitting, and contrastive approaches focus on inter-class discrimination at the expense of intra-sample semantic relationships. Our approach transfers "dark knowledge" through teacher-student contrastive alignment at the sample level. Specifically, our method first enforces intra-sample alignment by directly minimizing teacher-student logit discrepancies within individual samples. Then, we utilize inter-sample contrasts to preserve semantic dissimilarities across samples. By redefining positive pairs as aligned teacher-student logits from identical samples and negative pairs as cross-sample logit combinations, we reformulate these dual constraints into an InfoNCE loss framework, reducing computational complexity lower than sample squares while eliminating dependencies on temperature parameters and large batch sizes. We conduct comprehensive experiments across three benchmark datasets, including the CIFAR-100, ImageNet-1K, and MS COCO datasets, and experimental results clearly confirm the effectiveness of the proposed method on image classification, object detection, and instance segmentation tasks.
Marco Berrettini, Christian Hennig, Cinzia Viroli
Quantile-based classifiers can classify high-dimensional observations by minimising a discrepancy of an observation to a class based on suitable quantiles of the within-class distributions, corresponding to a unique percentage for all variables. The present work extends these classifiers by introducing a way to determine potentially different optimal percentages for different variables. Furthermore, a variable-wise scale parameter is introduced. A simple greedy algorithm to estimate the parameters is proposed. Their consistency in a nonparametric setting is proved. Experiments using artificially generated and real data confirm the potential of the quantile-based classifier with variable-wise parameters.
Khoi Do, Duong Nguyen, Nguyen H. Tran, Viet Dung Nguyen
Beyond class frequency, we recognize the impact of class-wise relationships among various class-specific predictions and the imbalance in label masks on long-tailed segmentation learning. To address these challenges, we propose an innovative Pixel-wise Adaptive Training (PAT) technique tailored for long-tailed segmentation. PAT has two key features: 1) class-wise gradient magnitude homogenization, and 2) pixel-wise class-specific loss adaptation (PCLA). First, the class-wise gradient magnitude homogenization helps alleviate the imbalance among label masks by ensuring equal consideration of the class-wise impact on model updates. Second, PCLA tackles the detrimental impact of both rare classes within the long-tailed distribution and inaccurate predictions from previous training stages by encouraging learning classes with low prediction confidence and guarding against forgetting classes with high confidence. This combined approach fosters robust learning while preventing the model from forgetting previously learned knowledge. PAT exhibits significant performance improvements, surpassing the current state-of-the-art by 2.2% in the NyU dataset. Moreover, it enhances overall pixel-wise accuracy by 2.85% and intersection over union value by 2.07%, with a particularly notable declination of 0.39% in detecting rare classes compared to Balance Logits Variation, as demonstrated on the three popular datasets, i.e., OxfordPetIII, CityScape, and NYU.
Francis Duey, James Schombert, Stacy McGaugh, Federico Lelli
We present WISE W1 photometry of the SPARC (Spitzer Photometry and Accurate
Rotation Curves) sample. The baseline of near-IR fluxes is established for use
by stellar mass models, a key component to the baryonic Tully-Fisher relation
and other kinematic galaxies scaling relations. We focus this paper on
determination of the characteristics of the W1 fluxes compared to IRAC 3.6
fluxes, internal accuracy limitations from photometric techniques, external
accuracy by comparison to other work in the literature and the range of W1 to
IRAC 3.6 colors. We outline the behavior of SDSS g, W1 and IRAC 3.6 colors with
respect to underlying SED features. We also note a previously unknown
correlation between WISE colors and the central surface brightness, probably
related to the low metallicity of low surface brightness dwarfs.
Authors' comments: Accepted to AJ, 19 pages, 10 figures
Tianyu Huang, Liangzu Peng, René Vidal, Yun-Hui Liu
Given an input set of $3$D point pairs, the goal of outlier-robust $3$D
registration is to compute some rotation and translation that align as many
point pairs as possible. This is an important problem in computer vision, for
which many highly accurate approaches have been recently proposed. Despite
their impressive performance, these approaches lack scalability, often
overflowing the $16$GB of memory of a standard laptop to handle roughly
$30,000$ point pairs. In this paper, we propose a $3$D registration approach
that can process more than ten million ($10^7$) point pairs with over $99\%$
random outliers. Moreover, our method is efficient, entails low memory costs,
and maintains high accuracy at the same time. We call our method TEAR, as it
involves minimizing an outlier-robust loss that computes Truncated Entry-wise
Absolute Residuals. To minimize this loss, we decompose the original
$6$-dimensional problem into two subproblems of dimensions $3$ and $2$,
respectively, solved in succession to global optimality via a customized
branch-and-bound method. While branch-and-bound is often slow and unscalable,
this does not apply to TEAR as we propose novel bounding functions that are
tight and computationally efficient. Experiments on various datasets are
conducted to validate the scalability and efficiency of our method.
Authors' comments: 24 pages, 12 figures. Accepted to CVPR 2024
Mei Qiu, Wei Lin, Lauren Ann Christopher, Stanley Chien, Yaobin Chen, Shu Hu
In the US, thousands of Pan, Tilt, and Zoom (PTZ) traffic cameras monitor highway conditions. There is a great interest in using these highway cameras to gather valuable road traffic data to support traffic analysis and decision-making for highway safety and efficient traffic management. However, there are too many cameras for a few human traffic operators to effectively monitor, so a fully automated solution is desired. This paper introduces a novel system that learns the locations of highway lanes and traffic directions from these camera feeds automatically. It collects real-time, lane-specific traffic data continuously, even adjusting for changes in camera angle or zoom. This facilitates efficient traffic analysis, decision-making, and improved highway safety.
Ningyi Liao, Zihao Yu, Siqiang Luo
Graph Neural Networks (GNNs) have shown promising performance in various graph learning tasks, but at the cost of resource-intensive computations. The primary overhead of GNN update stems from graph propagation and weight transformation, both involving operations on graph-scale matrices. Previous studies attempt to reduce the computational budget by leveraging graph-level or network-level sparsification techniques, resulting in downsized graph or weights. In this work, we propose Unifews, which unifies the two operations in an entry-wise manner considering individual matrix elements, and conducts joint edge-weight sparsification to enhance learning efficiency. The entry-wise design of Unifews enables adaptive compression across GNN layers with progressively increased sparsity, and is applicable to a variety of architectural designs with on-the-fly operation simplification. Theoretically, we establish a novel framework to characterize sparsified GNN learning in view of a graph optimization process, and prove that Unifews effectively approximates the learning objective with bounded error and reduced computational load. We conduct extensive experiments to evaluate the performance of our method in diverse settings. Unifews is advantageous in jointly removing more than 90% of edges and weight entries with comparable or better accuracy than baseline models. The sparsification offers remarkable efficiency improvements including 10-20x matrix operation reduction and up to 100x acceleration in graph propagation time for the largest graph at the billion-edge scale.
Jiangshan Wang, Yifan Pu, Yizeng Han, Jiayi Guo, Yiru Wang, Xiu Li, Gao Huang
Oriented object detection, an emerging task in recent years, aims to identify
and locate objects across varied orientations. This requires the detector to
accurately capture the orientation information, which varies significantly
within and across images. Despite the existing substantial efforts,
simultaneously ensuring model effectiveness and parameter efficiency remains
challenging in this scenario. In this paper, we propose a lightweight yet
effective Group-wise Rotating and Attention (GRA) module to replace the
convolution operations in backbone networks for oriented object detection. GRA
can adaptively capture fine-grained features of objects with diverse
orientations, comprising two key components: Group-wise Rotating and Group-wise
Attention. Group-wise Rotating first divides the convolution kernel into
groups, where each group extracts different object features by rotating at a
specific angle according to the object orientation. Subsequently, Group-wise
Attention is employed to adaptively enhance the object-related regions in the
feature. The collaborative effort of these components enables GRA to
effectively capture the various orientation information while maintaining
parameter efficiency. Extensive experimental results demonstrate the
superiority of our method. For example, GRA achieves a new state-of-the-art
(SOTA) on the DOTA-v2.0 benchmark, while saving the parameters by nearly 50%
compared to the previous SOTA method. Code will be released.
Authors' comments: tech report
Farhad Pakdaman, Moncef Gabbouj
The emerging Learned Compression (LC) replaces the traditional codec modules with Deep Neural Networks (DNN), which are trained end-to-end for rate-distortion performance. This approach is considered as the future of image/video compression, and major efforts have been dedicated to improving its compression efficiency. However, most proposed works target compression efficiency by employing more complex DNNS, which contributes to higher computational complexity. Alternatively, this paper proposes to improve compression by fully exploiting the existing DNN capacity. To do so, the latent features are guided to learn a richer and more diverse set of features, which corresponds to better reconstruction. A channel-wise feature decorrelation loss is designed and is integrated into the LC optimization. Three strategies are proposed and evaluated, which optimize (1) the transformation network, (2) the context model, and (3) both networks. Experimental results on two established LC methods show that the proposed method improves the compression with a BD-Rate of up to 8.06%, with no added complexity. The proposed solution can be applied as a plug-and-play solution to optimize any similar LC method.
Odin Zhang, Yufei Huang, Shichen Cheng, Mengyao Yu, Xujun Zhang, Haitao Lin, Yundian Zeng, Mingyang Wang et al.
Most earlier 3D structure-based molecular generation approaches follow an atom-wise paradigm, incrementally adding atoms to a partially built molecular fragment within protein pockets. These methods, while effective in designing tightly bound ligands, often overlook other essential properties such as synthesizability. The fragment-wise generation paradigm offers a promising solution. However, a common challenge across both atom-wise and fragment-wise methods lies in their limited ability to co-design plausible chemical and geometrical structures, resulting in distorted conformations. In response to this challenge, we introduce the Deep Geometry Handling protocol, a more abstract design that extends the design focus beyond the model architecture. Through a comprehensive review of existing geometry-related models and their protocols, we propose a novel hybrid strategy, culminating in the development of FragGen - a geometry-reliable, fragment-wise molecular generation method. FragGen marks a significant leap forward in the quality of generated geometry and the synthesis accessibility of molecules. The efficacy of FragGen is further validated by its successful application in designing type II kinase inhibitors at the nanomolar level.
Hayeon O, Chanuk Yang, Kunsoo Huh
In autonomous driving, 3D object detection provides more precise information
for downstream tasks, including path planning and motion estimation, compared
to 2D object detection. In this paper, we propose SeSame: a method aimed at
enhancing semantic information in existing LiDAR-only based 3D object
detection. This addresses the limitation of existing 3D detectors, which
primarily focus on object presence and classification, thus lacking in
capturing relationships between elemental units that constitute the data, akin
to semantic segmentation. Experiments demonstrate the effectiveness of our
method with performance improvements on the KITTI object detection benchmark.
Our code is available at https://github.com/HAMA-DL-dev/SeSame
Authors' comments: 17 pages, 4 figures
Jialin Chen, Zhiqiang Cai, Ke Xu, Di Wu, Wei Cao
Considering the noise level limit, one crucial aspect for quantum machine learning is to design a high-performing variational quantum circuit architecture with small number of quantum gates. As the classical neural architecture search (NAS), quantum architecture search methods (QAS) employ methods like reinforcement learning, evolutionary algorithms and supernet optimiza-tion to improve the search efficiency. In this paper, we propose a novel qubit-wise architec-ture search (QWAS) method, which progres-sively search one-qubit configuration per stage, and combine with Monte Carlo Tree Search al-gorithm to find good quantum architectures by partitioning the search space into several good and bad subregions. The numerical experimental results indicate that our proposed method can balance the exploration and exploitation of cir-cuit performance and size in some real-world tasks, such as MNIST, Fashion and MOSI. As far as we know, QWAS achieves the state-of-art re-sults of all tasks in the terms of accuracy and circuit size.
Yameng Peng, Andy Song, Haytham M. Fayek, Vic Ciesielski, Xiaojun Chang
Training-free metrics (a.k.a. zero-cost proxies) are widely used to avoid
resource-intensive neural network training, especially in Neural Architecture
Search (NAS). Recent studies show that existing training-free metrics have
several limitations, such as limited correlation and poor generalisation across
different search spaces and tasks. Hence, we propose Sample-Wise Activation
Patterns and its derivative, SWAP-Score, a novel high-performance training-free
metric. It measures the expressivity of networks over a batch of input samples.
The SWAP-Score is strongly correlated with ground-truth performance across
various search spaces and tasks, outperforming 15 existing training-free
metrics on NAS-Bench-101/201/301 and TransNAS-Bench-101. The SWAP-Score can be
further enhanced by regularisation, which leads to even higher correlations in
cell-based search space and enables model size control during the search. For
example, Spearman's rank correlation coefficient between regularised SWAP-Score
and CIFAR-100 validation accuracies on NAS-Bench-201 networks is 0.90,
significantly higher than 0.80 from the second-best metric, NWOT. When
integrated with an evolutionary algorithm for NAS, our SWAP-NAS achieves
competitive performance on CIFAR-10 and ImageNet in approximately 6 minutes and
9 minutes of GPU time respectively.
Authors' comments: ICLR2024 Spotlight
Shiwen Ni, Min Yang, Ruifeng Xu, Chengming Li, Xiping Hu
Among the various pre-trained neural language models that are popular today, dropout is already an indispensable regularization technique. To solve the inconsistency between training and inference caused by the randomness of dropout, some studies use consistency training to regularize dropout at the output layer. In this paper, we propose a novel Layer-wise Regularized Dropout (LR-Drop), which is specially designed for Transformer-based Language models. Specifically, LR-Drop layer-wise regularizes each Transformer layer using the consistency training strategy. Each training sample passes through the two siamese sub-models sampled by dropout, and then LR-Drop forces the hidden states, multi-head attention matrices, and output distribution of the two siamese sub-models to be consistent. The proposed LR-Drop can be regarded as a "self-distillation" framework, in which each sub-model generated by dropout is the other's "teacher" model and "student" model. Through extensive experiments on 8 natural language understanding datasets, 6 neural machine translation datasets, and 1 abstractive summarization dataset (a total of 15 datasets), we show that LR-Drop achieves superior performances, including state-of-the-art results.
Haoming Li, Yusen Huo, Shuai Dou, Zhenzhe Zheng, Zhilin Zhang, Chuan Yu, Jian Xu, Fan Wu
In online advertising, advertisers participate in ad auctions to acquire ad
opportunities, often by utilizing auto-bidding tools provided by demand-side
platforms (DSPs). The current auto-bidding algorithms typically employ
reinforcement learning (RL). However, due to safety concerns, most RL-based
auto-bidding policies are trained in simulation, leading to a performance
degradation when deployed in online environments. To narrow this gap, we can
deploy multiple auto-bidding agents in parallel to collect a large interaction
dataset. Offline RL algorithms can then be utilized to train a new policy. The
trained policy can subsequently be deployed for further data collection,
resulting in an iterative training framework, which we refer to as iterative
offline RL. In this work, we identify the performance bottleneck of this
iterative offline RL framework, which originates from the ineffective
exploration and exploitation caused by the inherent conservatism of offline RL
algorithms. To overcome this bottleneck, we propose Trajectory-wise Exploration
and Exploitation (TEE), which introduces a novel data collecting and data
utilization method for iterative offline RL from a trajectory perspective.
Furthermore, to ensure the safety of online exploration while preserving the
dataset quality for TEE, we propose Safe Exploration by Adaptive Action
Selection (SEAS). Both offline experiments and real-world experiments on
Alibaba display advertising platform demonstrate the effectiveness of our
proposed method.
Authors' comments: Accepted by The Web Conference 2024 (WWW'24) as an oral paper
Kei Nakatsuru, Seiichi Uchida
Kerning is the task of setting appropriate horizontal spaces for all possible letter pairs of a certain font. One of the difficulties of kerning is that the appropriate space differs for each letter pair. Therefore, for a total of 52 capital and small letters, we need to adjust $52 \times 52 = 2704$ different spaces. Another difficulty is that there is neither a general procedure nor criterion for automatic kerning; therefore, kerning is still done manually or with heuristics. In this paper, we tackle kerning by proposing two machine-learning models, called pairwise and set-wise models. The former is a simple deep neural network that estimates the letter space for two given letter images. In contrast, the latter is a Transformer-based model and estimates the letter spaces for three or more given letter images. For example, the set-wise model simultaneously estimates 2704 spaces for 52 letter images for a certain font. Among the two models, the set-wise model is not only more efficient but also more accurate because its internal self-attention mechanism allows for more consistent kerning for all letters. Experimental results on about 2500 Google fonts and their quantitative and qualitative analyses show that the set-wise model has an average estimation error of only about 5.3 pixels when the average letter space of all fonts and letter pairs is about 115 pixels.
Song Guo, Fan Wu, Lei Zhang, Xiawu Zheng, Shengchuan Zhang, Fei Chao, Yiyu Shi, Rongrong Ji
Existing methods for fine-tuning sparse LLMs often suffer from resource-intensive requirements and high retraining costs. Additionally, many fine-tuning methods often rely on approximations or heuristic optimization strategies, which may lead to suboptimal solutions. To address these issues, we propose an efficient and fast framework for fine-tuning sparse LLMs based on minimizing reconstruction error. Our approach involves sampling a small dataset for calibration and utilizing backpropagation to iteratively optimize block-wise reconstruction error, on a block-by-block basis, aiming for optimal solutions. Extensive experiments on various benchmarks consistently demonstrate the superiority of our method over other baselines. For instance, on the Wikitext2 dataset with LlamaV1-7B at 70% sparsity, our proposed EBFT achieves a perplexity of 16.88, surpassing the state-of-the-art DSnoT with a perplexity of 75.14. Moreover, with a structured sparsity ratio of 26\%, EBFT achieves a perplexity of 16.27, outperforming LoRA (perplexity 16.44). Furthermore, the fine-tuning process of EBFT for LlamaV1-7B only takes approximately 30 minutes, and the entire framework can be executed on a single 16GB GPU. The source code is available at https://github.com/sunggo/EBFT.
Zouying Cao, Yifei Yang, Hai Zhao
Large Language Models (LLMs) suffer from huge number of parameters, which
restricts their deployment on edge devices. Weight sharing is one promising
solution that encourages weight reuse, effectively reducing memory usage with
less performance drop. However, current weight sharing techniques primarily
focus on small-scale models like BERT and employ coarse-grained sharing rules,
e.g., layer-wise. This becomes limiting given the prevalence of LLMs and
sharing an entire layer or block obviously diminishes the flexibility of weight
sharing. In this paper, we present a perspective on head-wise shareable
attention for large language models. We further propose two memory-efficient
methods that share parameters across attention heads, with a specific focus on
LLMs. Both of them use the same dynamic strategy to select the shared weight
matrices. The first method directly reuses the pre-trained weights without
retraining, denoted as $\textbf{DirectShare}$. The second method first
post-trains with constraint on weight matrix similarity and then shares,
denoted as $\textbf{PostShare}$. Experimental results reveal our head-wise
shared models still maintain satisfactory capabilities, demonstrating the
feasibility of fine-grained weight sharing applied to LLMs.
Authors' comments: 17 pages, 7 figures, 21 tables, EMNLP'24 Findings
Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Knowledge editing aims to rectify inaccuracies in large language models (LLMs) without costly retraining for outdated or erroneous knowledge. However, current knowledge editing methods primarily focus on single editing, failing to meet the requirements for lifelong editing. In this paper, lifelong editing is synonymous with lifelong knowledge editing. This study reveals a performance degradation encountered by knowledge editing in lifelong editing, characterized by toxicity buildup and toxicity flash, with the primary cause identified as pattern unmatch. We introduce a knowledge editing approach named WilKE, which selects editing layer based on the pattern matching degree of editing knowledge across different layers. Experimental results demonstrate that, in lifelong editing, WilKE exhibits an average improvement of 46.2\% and 67.8\% on editing GPT2-XL and GPT-J relative to state-of-the-art knowledge editing methods.