Jiangshan Wang, Yifan Pu, Yizeng Han, Jiayi Guo, Yiru Wang, Xiu Li, Gao Huang
Oriented object detection, an emerging task in recent years, aims to identify
and locate objects across varied orientations. This requires the detector to
accurately capture the orientation information, which varies significantly
within and across images. Despite the existing substantial efforts,
simultaneously ensuring model effectiveness and parameter efficiency remains
challenging in this scenario. In this paper, we propose a lightweight yet
effective Group-wise Rotating and Attention (GRA) module to replace the
convolution operations in backbone networks for oriented object detection. GRA
can adaptively capture fine-grained features of objects with diverse
orientations, comprising two key components: Group-wise Rotating and Group-wise
Attention. Group-wise Rotating first divides the convolution kernel into
groups, where each group extracts different object features by rotating at a
specific angle according to the object orientation. Subsequently, Group-wise
Attention is employed to adaptively enhance the object-related regions in the
feature. The collaborative effort of these components enables GRA to
effectively capture the various orientation information while maintaining
parameter efficiency. Extensive experimental results demonstrate the
superiority of our method. For example, GRA achieves a new state-of-the-art
(SOTA) on the DOTA-v2.0 benchmark, while saving the parameters by nearly 50%
compared to the previous SOTA method. Code will be released.
Authors' comments: tech report
Farhad Pakdaman, Moncef Gabbouj
The emerging Learned Compression (LC) replaces the traditional codec modules with Deep Neural Networks (DNN), which are trained end-to-end for rate-distortion performance. This approach is considered as the future of image/video compression, and major efforts have been dedicated to improving its compression efficiency. However, most proposed works target compression efficiency by employing more complex DNNS, which contributes to higher computational complexity. Alternatively, this paper proposes to improve compression by fully exploiting the existing DNN capacity. To do so, the latent features are guided to learn a richer and more diverse set of features, which corresponds to better reconstruction. A channel-wise feature decorrelation loss is designed and is integrated into the LC optimization. Three strategies are proposed and evaluated, which optimize (1) the transformation network, (2) the context model, and (3) both networks. Experimental results on two established LC methods show that the proposed method improves the compression with a BD-Rate of up to 8.06%, with no added complexity. The proposed solution can be applied as a plug-and-play solution to optimize any similar LC method.
Odin Zhang, Yufei Huang, Shichen Cheng, Mengyao Yu, Xujun Zhang, Haitao Lin, Yundian Zeng, Mingyang Wang et al.
Most earlier 3D structure-based molecular generation approaches follow an atom-wise paradigm, incrementally adding atoms to a partially built molecular fragment within protein pockets. These methods, while effective in designing tightly bound ligands, often overlook other essential properties such as synthesizability. The fragment-wise generation paradigm offers a promising solution. However, a common challenge across both atom-wise and fragment-wise methods lies in their limited ability to co-design plausible chemical and geometrical structures, resulting in distorted conformations. In response to this challenge, we introduce the Deep Geometry Handling protocol, a more abstract design that extends the design focus beyond the model architecture. Through a comprehensive review of existing geometry-related models and their protocols, we propose a novel hybrid strategy, culminating in the development of FragGen - a geometry-reliable, fragment-wise molecular generation method. FragGen marks a significant leap forward in the quality of generated geometry and the synthesis accessibility of molecules. The efficacy of FragGen is further validated by its successful application in designing type II kinase inhibitors at the nanomolar level.
Hayeon O, Chanuk Yang, Kunsoo Huh
In autonomous driving, 3D object detection provides more precise information
for downstream tasks, including path planning and motion estimation, compared
to 2D object detection. In this paper, we propose SeSame: a method aimed at
enhancing semantic information in existing LiDAR-only based 3D object
detection. This addresses the limitation of existing 3D detectors, which
primarily focus on object presence and classification, thus lacking in
capturing relationships between elemental units that constitute the data, akin
to semantic segmentation. Experiments demonstrate the effectiveness of our
method with performance improvements on the KITTI object detection benchmark.
Our code is available at https://github.com/HAMA-DL-dev/SeSame
Authors' comments: 17 pages, 4 figures
Jialin Chen, Zhiqiang Cai, Ke Xu, Di Wu, Wei Cao
Considering the noise level limit, one crucial aspect for quantum machine learning is to design a high-performing variational quantum circuit architecture with small number of quantum gates. As the classical neural architecture search (NAS), quantum architecture search methods (QAS) employ methods like reinforcement learning, evolutionary algorithms and supernet optimiza-tion to improve the search efficiency. In this paper, we propose a novel qubit-wise architec-ture search (QWAS) method, which progres-sively search one-qubit configuration per stage, and combine with Monte Carlo Tree Search al-gorithm to find good quantum architectures by partitioning the search space into several good and bad subregions. The numerical experimental results indicate that our proposed method can balance the exploration and exploitation of cir-cuit performance and size in some real-world tasks, such as MNIST, Fashion and MOSI. As far as we know, QWAS achieves the state-of-art re-sults of all tasks in the terms of accuracy and circuit size.
Yameng Peng, Andy Song, Haytham M. Fayek, Vic Ciesielski, Xiaojun Chang
Training-free metrics (a.k.a. zero-cost proxies) are widely used to avoid
resource-intensive neural network training, especially in Neural Architecture
Search (NAS). Recent studies show that existing training-free metrics have
several limitations, such as limited correlation and poor generalisation across
different search spaces and tasks. Hence, we propose Sample-Wise Activation
Patterns and its derivative, SWAP-Score, a novel high-performance training-free
metric. It measures the expressivity of networks over a batch of input samples.
The SWAP-Score is strongly correlated with ground-truth performance across
various search spaces and tasks, outperforming 15 existing training-free
metrics on NAS-Bench-101/201/301 and TransNAS-Bench-101. The SWAP-Score can be
further enhanced by regularisation, which leads to even higher correlations in
cell-based search space and enables model size control during the search. For
example, Spearman's rank correlation coefficient between regularised SWAP-Score
and CIFAR-100 validation accuracies on NAS-Bench-201 networks is 0.90,
significantly higher than 0.80 from the second-best metric, NWOT. When
integrated with an evolutionary algorithm for NAS, our SWAP-NAS achieves
competitive performance on CIFAR-10 and ImageNet in approximately 6 minutes and
9 minutes of GPU time respectively.
Authors' comments: ICLR2024 Spotlight
Shiwen Ni, Min Yang, Ruifeng Xu, Chengming Li, Xiping Hu
Among the various pre-trained neural language models that are popular today, dropout is already an indispensable regularization technique. To solve the inconsistency between training and inference caused by the randomness of dropout, some studies use consistency training to regularize dropout at the output layer. In this paper, we propose a novel Layer-wise Regularized Dropout (LR-Drop), which is specially designed for Transformer-based Language models. Specifically, LR-Drop layer-wise regularizes each Transformer layer using the consistency training strategy. Each training sample passes through the two siamese sub-models sampled by dropout, and then LR-Drop forces the hidden states, multi-head attention matrices, and output distribution of the two siamese sub-models to be consistent. The proposed LR-Drop can be regarded as a "self-distillation" framework, in which each sub-model generated by dropout is the other's "teacher" model and "student" model. Through extensive experiments on 8 natural language understanding datasets, 6 neural machine translation datasets, and 1 abstractive summarization dataset (a total of 15 datasets), we show that LR-Drop achieves superior performances, including state-of-the-art results.
Haoming Li, Yusen Huo, Shuai Dou, Zhenzhe Zheng, Zhilin Zhang, Chuan Yu, Jian Xu, Fan Wu
In online advertising, advertisers participate in ad auctions to acquire ad
opportunities, often by utilizing auto-bidding tools provided by demand-side
platforms (DSPs). The current auto-bidding algorithms typically employ
reinforcement learning (RL). However, due to safety concerns, most RL-based
auto-bidding policies are trained in simulation, leading to a performance
degradation when deployed in online environments. To narrow this gap, we can
deploy multiple auto-bidding agents in parallel to collect a large interaction
dataset. Offline RL algorithms can then be utilized to train a new policy. The
trained policy can subsequently be deployed for further data collection,
resulting in an iterative training framework, which we refer to as iterative
offline RL. In this work, we identify the performance bottleneck of this
iterative offline RL framework, which originates from the ineffective
exploration and exploitation caused by the inherent conservatism of offline RL
algorithms. To overcome this bottleneck, we propose Trajectory-wise Exploration
and Exploitation (TEE), which introduces a novel data collecting and data
utilization method for iterative offline RL from a trajectory perspective.
Furthermore, to ensure the safety of online exploration while preserving the
dataset quality for TEE, we propose Safe Exploration by Adaptive Action
Selection (SEAS). Both offline experiments and real-world experiments on
Alibaba display advertising platform demonstrate the effectiveness of our
proposed method.
Authors' comments: Accepted by The Web Conference 2024 (WWW'24) as an oral paper
Kei Nakatsuru, Seiichi Uchida
Kerning is the task of setting appropriate horizontal spaces for all possible letter pairs of a certain font. One of the difficulties of kerning is that the appropriate space differs for each letter pair. Therefore, for a total of 52 capital and small letters, we need to adjust $52 \times 52 = 2704$ different spaces. Another difficulty is that there is neither a general procedure nor criterion for automatic kerning; therefore, kerning is still done manually or with heuristics. In this paper, we tackle kerning by proposing two machine-learning models, called pairwise and set-wise models. The former is a simple deep neural network that estimates the letter space for two given letter images. In contrast, the latter is a Transformer-based model and estimates the letter spaces for three or more given letter images. For example, the set-wise model simultaneously estimates 2704 spaces for 52 letter images for a certain font. Among the two models, the set-wise model is not only more efficient but also more accurate because its internal self-attention mechanism allows for more consistent kerning for all letters. Experimental results on about 2500 Google fonts and their quantitative and qualitative analyses show that the set-wise model has an average estimation error of only about 5.3 pixels when the average letter space of all fonts and letter pairs is about 115 pixels.
Song Guo, Fan Wu, Lei Zhang, Xiawu Zheng, Shengchuan Zhang, Fei Chao, Yiyu Shi, Rongrong Ji
Existing methods for fine-tuning sparse LLMs often suffer from resource-intensive requirements and high retraining costs. Additionally, many fine-tuning methods often rely on approximations or heuristic optimization strategies, which may lead to suboptimal solutions. To address these issues, we propose an efficient and fast framework for fine-tuning sparse LLMs based on minimizing reconstruction error. Our approach involves sampling a small dataset for calibration and utilizing backpropagation to iteratively optimize block-wise reconstruction error, on a block-by-block basis, aiming for optimal solutions. Extensive experiments on various benchmarks consistently demonstrate the superiority of our method over other baselines. For instance, on the Wikitext2 dataset with LlamaV1-7B at 70% sparsity, our proposed EBFT achieves a perplexity of 16.88, surpassing the state-of-the-art DSnoT with a perplexity of 75.14. Moreover, with a structured sparsity ratio of 26\%, EBFT achieves a perplexity of 16.27, outperforming LoRA (perplexity 16.44). Furthermore, the fine-tuning process of EBFT for LlamaV1-7B only takes approximately 30 minutes, and the entire framework can be executed on a single 16GB GPU. The source code is available at https://github.com/sunggo/EBFT.
Zouying Cao, Yifei Yang, Hai Zhao
Large Language Models (LLMs) suffer from huge number of parameters, which
restricts their deployment on edge devices. Weight sharing is one promising
solution that encourages weight reuse, effectively reducing memory usage with
less performance drop. However, current weight sharing techniques primarily
focus on small-scale models like BERT and employ coarse-grained sharing rules,
e.g., layer-wise. This becomes limiting given the prevalence of LLMs and
sharing an entire layer or block obviously diminishes the flexibility of weight
sharing. In this paper, we present a perspective on head-wise shareable
attention for large language models. We further propose two memory-efficient
methods that share parameters across attention heads, with a specific focus on
LLMs. Both of them use the same dynamic strategy to select the shared weight
matrices. The first method directly reuses the pre-trained weights without
retraining, denoted as $\textbf{DirectShare}$. The second method first
post-trains with constraint on weight matrix similarity and then shares,
denoted as $\textbf{PostShare}$. Experimental results reveal our head-wise
shared models still maintain satisfactory capabilities, demonstrating the
feasibility of fine-grained weight sharing applied to LLMs.
Authors' comments: 17 pages, 7 figures, 21 tables, EMNLP'24 Findings
Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Knowledge editing aims to rectify inaccuracies in large language models (LLMs) without costly retraining for outdated or erroneous knowledge. However, current knowledge editing methods primarily focus on single editing, failing to meet the requirements for lifelong editing. In this paper, lifelong editing is synonymous with lifelong knowledge editing. This study reveals a performance degradation encountered by knowledge editing in lifelong editing, characterized by toxicity buildup and toxicity flash, with the primary cause identified as pattern unmatch. We introduce a knowledge editing approach named WilKE, which selects editing layer based on the pattern matching degree of editing knowledge across different layers. Experimental results demonstrate that, in lifelong editing, WilKE exhibits an average improvement of 46.2\% and 67.8\% on editing GPT2-XL and GPT-J relative to state-of-the-art knowledge editing methods.
Xinjian Zhao, Liang Zhang, Yang Liu, Ruocheng Guo, Xiangyu Zhao
Graph contrastive learning (GCL) has emerged as a pivotal technique in the domain of graph representation learning. A crucial aspect of effective GCL is the caliber of generated positive and negative samples, which is intrinsically dictated by their resemblance to the original data. Nevertheless, precise control over similarity during sample generation presents a formidable challenge, often impeding the effective discovery of representative graph patterns. To address this challenge, we propose an innovative framework: Adversarial Curriculum Graph Contrastive Learning (ACGCL), which capitalizes on the merits of pair-wise augmentation to engender graph-level positive and negative samples with controllable similarity, alongside subgraph contrastive learning to discern effective graph patterns therein. Within the ACGCL framework, we have devised a novel adversarial curriculum training methodology that facilitates progressive learning by sequentially increasing the difficulty of distinguishing the generated samples. Notably, this approach transcends the prevalent sparsity issue inherent in conventional curriculum learning strategies by adaptively concentrating on more challenging training data. Finally, a comprehensive assessment of ACGCL is conducted through extensive experiments on six well-known benchmark datasets, wherein ACGCL conspicuously surpasses a set of state-of-the-art baselines.
Bram Vanherle, Vittorio Pippi, Silvia Cascianelli, Nick Michiels, Frank Van Reeth, Rita Cucchiara
Styled Handwritten Text Generation (HTG) has received significant attention in recent years, propelled by the success of learning-based solutions employing GANs, Transformers, and, preliminarily, Diffusion Models. Despite this surge in interest, there remains a critical yet understudied aspect - the impact of the input, both visual and textual, on the HTG model training and its subsequent influence on performance. This study delves deeper into a cutting-edge Styled-HTG approach, proposing strategies for input preparation and training regularization that allow the model to achieve better performance and generalize better. These aspects are validated through extensive analysis on several different settings and datasets. Moreover, in this work, we go beyond performance optimization and address a significant hurdle in HTG research - the lack of a standardized evaluation protocol. In particular, we propose a standardization of the evaluation protocol for HTG and conduct a comprehensive benchmarking of existing approaches. By doing so, we aim to establish a foundation for fair and meaningful comparisons between HTG strategies, fostering progress in the field.
Artem Chernikov, Henry Towsner
We investigate various forms of (model-theoretic) stability for hypergraphs
and their corresponding strengthenings of the hypergraph regularity lemma with
respect to partitions of vertices. On the one hand, we provide a complete
classification of the various possibilities in the ternary case. On the other
hand, we provide an example of a family of slice-wise stable 3-hypergraphs so
that for no partition of the vertices, any triple of parts has density close to
0 or 1. In particular, this addresses some questions and conjectures of Terry
and Wolf. We work in the general measure theoretic context of graded
probability spaces, so all our results apply both to measures in ultraproducts
of finite graphs, leading to the aforementioned combinatorial applications, and
to commuting definable Keisler measures, leading to applications in model
theory.
Authors' comments: 67 pages
Xiaofeng Liu, Nadya Shusharina, Helen A Shih, C. -C. Jay Kuo, Georges El Fakhri, Jonghye Woo
In this work, we aim to predict the survival time (ST) of glioblastoma (GBM)
patients undergoing different treatments based on preoperative magnetic
resonance (MR) scans. The personalized and precise treatment planning can be
achieved by comparing the ST of different treatments. It is well established
that both the current status of the patient (as represented by the MR scans)
and the choice of treatment are the cause of ST. While previous related
MR-based glioblastoma ST studies have focused only on the direct mapping of MR
scans to ST, they have not included the underlying causal relationship between
treatments and ST. To address this limitation, we propose a
treatment-conditioned regression model for glioblastoma ST that incorporates
treatment information in addition to MR scans. Our approach allows us to
effectively utilize the data from all of the treatments in a unified manner,
rather than having to train separate models for each of the treatments.
Furthermore, treatment can be effectively injected into each convolutional
layer through the adaptive instance normalization we employ. We evaluate our
framework on the BraTS20 ST prediction task. Three treatment options are
considered: Gross Total Resection (GTR), Subtotal Resection (STR), and no
resection. The evaluation results demonstrate the effectiveness of injecting
the treatment for estimating GBM survival.
Authors' comments: SPIE Medical Imaging 2024: Computer-Aided Diagnosis
Reduan Achtibat, Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Aakriti Jain, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
Large Language Models are prone to biased predictions and hallucinations, underlining the paramount importance of understanding their model-internal reasoning process. However, achieving faithful attributions for the entirety of a black-box transformer model and maintaining computational efficiency is an unsolved challenge. By extending the Layer-wise Relevance Propagation attribution method to handle attention layers, we address these challenges effectively. While partial solutions exist, our method is the first to faithfully and holistically attribute not only input but also latent representations of transformer models with the computational efficiency similar to a singular backward pass. Through extensive evaluations against existing methods on Llama 2, Flan-T5 and the Vision Transformer architecture, we demonstrate that our proposed approach surpasses alternative methods in terms of faithfulness and enables the understanding of latent representations, opening up the door for concept-based explanations. We provide an open-source implementation on GitHub https://github.com/rachtibat/LRP-for-Transformers.
Karim Helwani, Masahito Togami, Paris Smaragdis, Michael M. Goodwin
While neural network approaches have made significant strides in resolving classical signal processing problems, it is often the case that hybrid approaches that draw insight from both signal processing and neural networks produce more complete solutions. In this paper, we present a hybrid classical digital signal processing/deep neural network (DSP/DNN) approach to source separation (SS) highlighting the theoretical link between variational autoencoder and classical approaches to SS. We propose a system that transforms the single channel under-determined SS task to an equivalent multichannel over-determined SS problem in a properly designed latent space. The separation task in the latent space is treated as finding a variational block-wise disentangled representation of the mixture. We show empirically, that the design choices and the variational formulation of the task at hand motivated by the classical signal processing theoretical results lead to robustness to unseen out-of-distribution data and reduction of the overfitting risk. To address the resulting permutation issue we explicitly incorporate a novel differentiable permutation loss function and augment the model with a memory mechanism to keep track of the statistics of the individual sources.
Xunkai Li, Jingyuan Ma, Zhengyu Wu, Daohan Su, Wentao Zhang, Rong-Hua Li, Guoren Wang
Scalable graph neural networks (GNNs) have emerged as a promising technique,
which exhibits superior predictive performance and high running efficiency
across numerous large-scale graph-based web applications. However, (i) Most
scalable GNNs tend to treat all nodes in graphs with the same propagation
rules, neglecting their topological uniqueness; (ii) Existing node-wise
propagation optimization strategies are insufficient on web-scale graphs with
intricate topology, where a full portrayal of nodes' local properties is
required. Intuitively, different nodes in web-scale graphs possess distinct
topological roles, and therefore propagating them indiscriminately or neglect
local contexts may compromise the quality of node representations. This
intricate topology in web-scale graphs cannot be matched by small-scale
scenarios. To address the above issues, we propose \textbf{A}daptive
\textbf{T}opology-aware \textbf{P}ropagation (ATP), which reduces potential
high-bias propagation and extracts structural patterns of each node in a
scalable manner to improve running efficiency and predictive performance.
Remarkably, ATP is crafted to be a plug-and-play node-wise propagation
optimization strategy, allowing for offline execution independent of the graph
learning process in a new perspective. Therefore, this approach can be
seamlessly integrated into most scalable GNNs while remain orthogonal to
existing node-wise propagation optimization strategies. Extensive experiments
on 12 datasets, including the most representative large-scale ogbn-papers100M,
have demonstrated the effectiveness of ATP. Specifically, ATP has proven to be
efficient in improving the performance of prevalent scalable GNNs for
semi-supervised node classification while addressing redundant computational
costs.
Authors' comments: Accepted by WWW 2024
Dongxia Wu, Tsuyoshi Id, Aurlie Lozano, Georgios Kollias, Ji Navrtil, Naoki Abe, Yi-An Ma, Rose Yu
We address the problem of learning Granger causality from asynchronous, interdependent, multi-type event sequences. In particular, we are interested in discovering instance-level causal structures in an unsupervised manner. Instance-level causality identifies causal relationships among individual events, providing more fine-grained information for decision-making. Existing work in the literature either requires strong assumptions, such as linearity in the intensity function, or heuristically defined model parameters that do not necessarily meet the requirements of Granger causality. We propose Instance-wise Self-Attentive Hawkes Processes (ISAHP), a novel deep learning framework that can directly infer the Granger causality at the event instance level. ISAHP is the first neural point process model that meets the requirements of Granger causality. It leverages the self-attention mechanism of the transformer to align with the principles of Granger causality. We empirically demonstrate that ISAHP is capable of discovering complex instance-level causal structures that cannot be handled by classical models. We also show that ISAHP achieves state-of-the-art performance in proxy tasks involving type-level causal discovery and instance-level event type prediction.