Xiaoyu Kong, Jiancan Wu, An Zhang, Leheng Sheng, Hui Lin, Xiang Wang, Xiangnan He
Sequential recommendation systems predict the next interaction item based on users' past interactions, aligning recommendations with individual preferences. Leveraging the strengths of Large Language Models (LLMs) in knowledge comprehension and reasoning, recent approaches are eager to apply LLMs to sequential recommendation. A common paradigm is converting user behavior sequences into instruction data, and fine-tuning the LLM with parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaption (LoRA). However, the uniform application of LoRA across diverse user behaviors is insufficient to capture individual variability, resulting in negative transfer between disparate sequences. To address these challenges, we propose Instance-wise LoRA (iLoRA). We innovatively treat the sequential recommendation task as a form of multi-task learning, integrating LoRA with the Mixture of Experts (MoE) framework. This approach encourages different experts to capture various aspects of user behavior. Additionally, we introduce a sequence representation guided gate function that generates customized expert participation weights for each user sequence, which allows dynamic parameter adjustment for instance-wise recommendations. In sequential recommendation, iLoRA achieves an average relative improvement of 11.4\% over basic LoRA in the hit ratio metric, with less than a 1\% relative increase in trainable parameters. Extensive experiments on three benchmark datasets demonstrate the effectiveness of iLoRA, highlighting its superior performance compared to existing methods in mitigating negative transfer and improving recommendation accuracy. Our data and code are available at https://github.com/AkaliKong/iLoRA.
Yabin Wang, Zhiwu Huang, Su Zhou, Adam Prugel-Bennett, Xiaopeng Hong
The diffusion of deepfake technologies has sparked serious concerns about its potential misuse across various domains, prompting the urgent need for robust detection methods. Despite advancement, many current approaches prioritize short-term gains at expense of long-term effectiveness. This paper critiques the overly specialized approach of fine-tuning pre-trained models solely with a penny-wise objective on a single deepfake dataset, while disregarding the pound-wise balance for generalization and knowledge retention. To address this "Penny-Wise and Pound-Foolish" issue, we propose a novel learning framework (PoundNet) for generalization of deepfake detection on a pre-trained vision-language model. PoundNet incorporates a learnable prompt design and a balanced objective to preserve broad knowledge from upstream tasks (object classification) while enhancing generalization for downstream tasks (deepfake detection). We train PoundNet on a standard single deepfake dataset, following common practice in the literature. We then evaluate its performance across 10 public large-scale deepfake datasets with 5 main evaluation metrics-forming the largest benchmark test set for assessing the generalization ability of deepfake detection models, to our knowledge. The comprehensive benchmark evaluation demonstrates the proposed PoundNet is significantly less "Penny-Wise and Pound-Foolish", achieving a remarkable improvement of 19% in deepfake detection performance compared to state-of-the-art methods, while maintaining a strong performance of 63% on object classification tasks, where other deepfake detection models tend to be ineffective. Code and data are open-sourced at https://github.com/iamwangyabin/PoundNet.
Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li et al.
This paper identifies significant redundancy in the query-key interactions
within self-attention mechanisms of diffusion transformer models, particularly
during the early stages of denoising diffusion steps. In response to this
observation, we present a novel diffusion transformer framework incorporating
an additional set of mediator tokens to engage with queries and keys
separately. By modulating the number of mediator tokens during the denoising
generation phases, our model initiates the denoising process with a precise,
non-ambiguous stage and gradually transitions to a phase enriched with detail.
Concurrently, integrating mediator tokens simplifies the attention module's
complexity to a linear scale, enhancing the efficiency of global attention
processes. Additionally, we propose a time-step dynamic mediator token
adjustment mechanism that further decreases the required computational FLOPs
for generation, simultaneously facilitating the generation of high-quality
images within the constraints of varied inference budgets. Extensive
experiments demonstrate that the proposed method can improve the generated
image quality while also reducing the inference cost of diffusion transformers.
When integrated with the recent work SiT, our method achieves a
state-of-the-art FID score of 2.01. The source code is available at
https://github.com/LeapLabTHU/Attention-Mediators.
Authors' comments: ECCV 2024
David Zagardo
Traditional Differentially Private Stochastic Gradient Descent (DP-SGD)
introduces statistical noise on top of gradients drawn from a Gaussian
distribution to ensure privacy. This paper introduces the novel Differentially
Private Block-wise Gradient Shuffle (DP-BloGS) algorithm for deep learning.
BloGS builds off of existing private deep learning literature, but makes a
definitive shift by taking a probabilistic approach to gradient noise
introduction through shuffling modeled after information theoretic privacy
analyses. The theoretical results presented in this paper show that the
combination of shuffling, parameter-specific block size selection, batch layer
clipping, and gradient accumulation allows DP-BloGS to achieve training times
close to that of non-private training while maintaining similar privacy and
utility guarantees to DP-SGD. DP-BloGS is found to be significantly more
resistant to data extraction attempts than DP-SGD. The theoretical results are
validated by the experimental findings.
Authors' comments: 43 pages, 11 figures, 8 tables
Aadarsh Singh
In this paper, we have explored the impact of certain indices-dependent
element-wise transformations on the null space of a matrix. We have found the
conditions on this transformation that will preserve the rank and nullity of
the original matrix. We have also found some transformations which give
localized null vectors for the transformed matrix. Finally, some possible
applications of these localized null vectors and eigenvalues are mentioned in
different domains.
Authors' comments: 13 pages
Ruizi Han, Jinglei Tang
Parameter-efficient transfer learning (PETL) aims to adapt large pre-trained
models using limited parameters. While most PETL approaches update the added
parameters and freeze pre-trained weights during training, the minimal impact
of task-specific deep layers on cross-domain data poses a challenge as PETL
cannot modify them, resulting in redundant model structures. Structural pruning
effectively reduces model redundancy; however, common pruning methods often
lead to an excessive increase in stored parameters due to varying pruning
structures based on pruning rates and data. Recognizing the storage parameter
volume issue, we propose a Straightforward layer-wise pruning method, called
SLS, for pruning PETL-transferred models. By evaluating parameters from a
feature perspective of each layer and utilizing clustering metrics to assess
current parameters based on clustering phenomena in low-dimensional space
obtained through t-SNE, SLS facilitates informed pruning decisions. Our study
reveals that layer-wise pruning, with a focus on storing pruning indices,
addresses storage volume concerns. Notably, mainstream Layer-wise pruning
methods may not be suitable for assessing layer importance in PETL-transferred
models, where the majority of parameters are pre-trained and have limited
relevance to downstream datasets. Comparative analysis against state-of-the-art
PETL methods demonstrates that the pruned model achieved a notable balance
between model throughput and accuracy. Moreover, SLS effectively reduces
storage overhead arising from varying pruned structures while enhancing the
accuracy and speed of pruned models compared to conventional pruning methods.
Authors' comments: published to ECCV2024
Jiahong Ma, Mingguo He, Zhewei Wei
Spectral Graph Neural Networks have demonstrated superior performance in
graph representation learning. However, many current methods focus on employing
shared polynomial coefficients for all nodes, i.e., learning node-unified
filters, which limits the filters' flexibility for node-level tasks. The recent
DSF attempts to overcome this limitation by learning node-wise coefficients
based on positional encoding. However, the initialization and updating process
of the positional encoding are burdensome, hindering scalability on large-scale
graphs. In this work, we propose a scalable node-wise filter, PolyAttn.
Leveraging the attention mechanism, PolyAttn can directly learn node-wise
filters in an efficient manner, offering powerful representation capabilities.
Building on PolyAttn, we introduce the whole model, named PolyFormer. In the
lens of Graph Transformer models, PolyFormer, which calculates attention scores
within nodes, shows great scalability. Moreover, the model captures spectral
information, enhancing expressiveness while maintaining efficiency. With these
advantages, PolyFormer offers a desirable balance between scalability and
expressiveness for node-level tasks. Extensive experiments demonstrate that our
proposed methods excel at learning arbitrary node-wise filters, showing
superior performance on both homophilic and heterophilic graphs, and handling
graphs containing up to 100 million nodes. The code is available at
https://github.com/air029/PolyFormer.
Authors' comments: ACM SIGKDD 2024
Luís Almeida, Inês Dutra, Francesco Renna
Semantic segmentation is a fundamental computer vision task with a vast number of applications. State of the art methods increasingly rely on deep learning models, known to incorrectly estimate uncertainty and being overconfident in predictions, especially in data not seen during training. This is particularly problematic in semantic segmentation due to inherent class imbalance. Popular uncertainty quantification approaches are task-agnostic and fail to leverage spatial pixel correlations in uncertainty estimates, crucial in this task. In this work, a novel training methodology specifically designed for semantic segmentation is presented. Training samples are weighted by instance-wise uncertainty masks computed by an ensemble. This is shown to increase performance on minority classes, boost model generalization and robustness to domain-shift when compared to using the inverse of class proportions or no class weights at all. This method addresses the challenges of class imbalance and uncertainty estimation in semantic segmentation, potentially enhancing model performance and reliability across various applications.
Seitaro Otsuki, Tsumugi Iida, Félix Doublet, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Komei Sugiura
The transparent formulation of explanation methods is essential for
elucidating the predictions of neural networks, which are typically black-box
models. Layer-wise Relevance Propagation (LRP) is a well-established method
that transparently traces the flow of a model's prediction backward through its
architecture by backpropagating relevance scores. However, the conventional LRP
does not fully consider the existence of skip connections, and thus its
application to the widely used ResNet architecture has not been thoroughly
explored. In this study, we extend LRP to ResNet models by introducing
Relevance Splitting at points where the output from a skip connection converges
with that from a residual block. Our formulation guarantees the conservation
property throughout the process, thereby preserving the integrity of the
generated explanations. To evaluate the effectiveness of our approach, we
conduct experiments on ImageNet and the Caltech-UCSD Birds-200-2011 dataset.
Our method achieves superior performance to that of baseline methods on
standard evaluation metrics such as the Insertion-Deletion score while
maintaining its conservation property. We will release our code for further
research at https://5ei74r0.github.io/lrp-for-resnet.page/
Authors' comments: Accepted for presentation at ECCV2024
Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao
Quantizing the activations of large language models (LLMs) has been a significant challenge due to the presence of structured outliers. Most existing methods focus on the per-token or per-tensor quantization of activations, making it difficult to achieve both accuracy and hardware efficiency. To address this problem, we propose OutlierTune, an efficient per-channel post-training quantization (PTQ) method for the activations of LLMs. OutlierTune consists of two components: pre-execution of dequantization and symmetrization. The pre-execution of dequantization updates the model weights by the activation scaling factors, avoiding the internal scaling and costly additional computational overheads brought by the per-channel activation quantization. The symmetrization further reduces the quantization differences arising from the weight updates by ensuring the balanced numerical ranges across different activation channels. OutlierTune is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference. Extensive experiments show that the proposed framework outperforms existing methods across multiple different tasks. Demonstrating better generalization, this framework improves the Int6 quantization of the instruction-tuning LLMs, such as OPT-IML, to the same level as half-precision (FP16). Moreover, we have shown that the proposed framework is 1.48x faster than the FP16 implementation while reducing approximately 2x memory usage.
Kevin Gu, Eva Tuecke, Dmitriy Katz, Raya Horesh, David Alvarez-Melis, Mikhail Yurochkin
Large language models (LLMs) have shown remarkable potential for problem
solving, with open source models achieving increasingly impressive performance
on benchmarks measuring areas from logical reasoning to mathematical ability.
Ensembling models can further improve capabilities across a variety of domains.
However, conventional methods of combining models at inference time such as
shallow fusion necessitate a shared vocabulary and tokenization, and
alternatives like fine-tuning for domain-specific performance are both time
consuming and computationally expensive. We therefore present an inference-time
ensembling algorithm aimed at "averaging" outputs from multiple LLMs and
illustrate its improved performance across multiple domains compared to its
constituent models alone. Character-wise ensemble decoding, CharED, finds the
marginal distribution of each character for an individual model and performs a
weighted average to generate an output, character by character. In coding,
math, and toxicity benchmarks, we find our proposed model able to combine
complimentary strengths of multiple LLMs, regardless of vocabulary,
tokenization, or model size.
Authors' comments: 9 pages, 4 figures
Derck W. E. Prinzhorn, Thijmen Nijdam, Putri A. van der Linden, Alexander Timans
Conformal prediction offers a practical framework for distribution-free
uncertainty quantification, providing finite-sample coverage guarantees under
relatively mild assumptions on data exchangeability. However, these assumptions
cease to hold for time series due to their temporally correlated nature. In
this work, we present a novel use of conformal prediction for time series
forecasting that incorporates time series decomposition. This approach allows
us to model different temporal components individually. By applying specific
conformal algorithms to each component and then merging the obtained prediction
intervals, we customize our methods to account for the different
exchangeability regimes underlying each component. Our decomposition-based
approach is thoroughly discussed and empirically evaluated on synthetic and
real-world data. We find that the method provides promising results on
well-structured time series, but can be limited by factors such as the
decomposition step for more complex data.
Authors' comments: Accepted at COPA 2024; 34 pages, 14 figures, 8 tables (incl.
appendix)
Weijian Chen, Zai Yang, Zhiqiang Wei, Derrick Wing Kwan Ng, Michail Matthaiou
This paper proposes a joint active and passive beamforming design for
reconfigurable intelligent surface (RIS)-aided wireless communication systems,
adopting a piece-wise near-field channel model. While a traditional near-field
channel model, applied without any approximations, offers higher modeling
accuracy than a far-field model, it renders the system design more sensitive to
channel estimation errors (CEEs). As a remedy, we propose to adopt a piece-wise
near-field channel model that leverages the advantages of the near-field
approach while enhancing its robustness against CEEs. Our study analyzes the
impact of different channel models, including the traditional near-field, the
proposed piece-wise near-field and far-field channel models, on the
interference distribution caused by CEEs and model mismatches. Subsequently, by
treating the interference as noise, we formulate a joint active and passive
beamforming design problem to maximize the spectral efficiency (SE). The
formulated problem is then recast as a mean squared error (MSE) minimization
problem and a suboptimal algorithm is developed to iteratively update the
active and passive beamforming strategies. Simulation results demonstrate that
adopting the piece-wise near-field channel model leads to an improved SE
compared to both the near-field and far-field models in the presence of CEEs.
Furthermore, the proposed piece-wise near-field model achieves a good trade-off
between modeling accuracy and system's degrees of freedom (DoF).
Authors' comments: 28pages
Jiachen Jiang, Jinxin Zhou, Zhihui Zhu
Analyzing the similarity of internal representations has been an important technique for understanding the behavior of deep neural networks. Most existing methods for analyzing the similarity between representations of high dimensions, such as those based on Centered Kernel Alignment (CKA), rely on statistical properties of the representations for a set of data points. In this paper, we focus on transformer models and study the similarity of representations between the hidden layers of individual transformers. In this context, we show that a simple sample-wise cosine similarity metric is capable of capturing the similarity and aligns with the complicated CKA. Our experimental results on common transformers reveal that representations across layers are positively correlated, with similarity increasing when layers get closer. We provide a theoretical justification for this phenomenon under the geodesic curve assumption for the learned transformer. We then show that an increase in representation similarity implies an increase in predicted probability when directly applying the last-layer classifier to any hidden layer representation. We then propose an aligned training method to improve the effectiveness of shallow layer by enhancing the similarity between internal representations, with trained models that enjoy the following properties: (1) more early saturation events, (2) layer-wise accuracies monotonically increase and reveal the minimal depth needed for the given task, (3) when served as multi-exit models, they achieve on-par performance with standard multi-exit architectures which consist of additional classifiers designed for early exiting in shallow layers. To our knowledge, our work is the first to show that one common classifier is sufficient for multi-exit models. We conduct experiments on both vision and NLP tasks to demonstrate the performance of the proposed aligned training.
Hewen Wang, Renchi Yang, Xiaokui Xiao
Graph representation learning (GRL) is to encode graph elements into
informative vector representations, which can be used in downstream tasks for
analyzing graph-structured data and has seen extensive applications in various
domains. However, the majority of extant studies on GRL are geared towards
generating node representations, which cannot be readily employed to perform
edge-based analytics tasks in edge-attributed bipartite graphs (EABGs) that
pervade the real world, e.g., spam review detection in customer-product reviews
and identifying fraudulent transactions in user-merchant networks. Compared to
node-wise GRL, learning edge representations (ERL) on such graphs is
challenging due to the need to incorporate the structure and attribute
semantics from the perspective of edges while considering the separate
influence of two heterogeneous node sets U and V in bipartite graphs. To our
knowledge, despite its importance, limited research has been devoted to this
frontier, and existing workarounds all suffer from sub-par results.
Motivated by this, this paper designs EAGLE, an effective ERL method for
EABGs. Building on an in-depth and rigorous theoretical analysis, we propose
the factorized feature propagation (FFP) scheme for edge representations with
adequate incorporation of long-range dependencies of edges/features without
incurring tremendous computation overheads. We further ameliorate FFP as a
dual-view FFP by taking into account the influences from nodes in U and V
severally in ERL. Extensive experiments on 5 real datasets showcase the
effectiveness of the proposed EAGLE models in semi-supervised edge
classification tasks. In particular, EAGLE can attain a considerable gain of at
most 38.11% in AP and 1.86% in AUC when compared to the best baselines.
Authors' comments: 11 pages. Full version of the research paper accepted to KDD 2024
Xiaoxiong Zhang, Zhiwei Zeng, Xin Zhou, Dusit Niyato, Zhiqi Shen
Federated Knowledge Graph Embedding (FKGE) has recently garnered considerable interest due to its capacity to extract expressive representations from distributed knowledge graphs, while concurrently safeguarding the privacy of individual clients. Existing FKGE methods typically harness the arithmetic mean of entity embeddings from all clients as the global supplementary knowledge, and learn a replica of global consensus entities embeddings for each client. However, these methods usually neglect the inherent semantic disparities among distinct clients. This oversight not only results in the globally shared complementary knowledge being inundated with too much noise when tailored to a specific client, but also instigates a discrepancy between local and global optimization objectives. Consequently, the quality of the learned embeddings is compromised. To address this, we propose Personalized Federated knowledge graph Embedding with client-wise relation Graph (PFedEG), a novel approach that employs a client-wise relation graph to learn personalized embeddings by discerning the semantic relevance of embeddings from other clients. Specifically, PFedEG learns personalized supplementary knowledge for each client by amalgamating entity embedding from its neighboring clients based on their "affinity" on the client-wise relation graph. Each client then conducts personalized embedding learning based on its local triples and personalized supplementary knowledge. We conduct extensive experiments on four benchmark datasets to evaluate our method against state-of-the-art models and results demonstrate the superiority of our method.
Tian Liu, Huixin Zhang, Shubham Parashar, Shu Kong
Few-shot recognition aims to train a classification model with only a few labeled examples of pre-defined concepts, where annotation can be costly in a downstream task. In another related research area, zero-shot recognition, which assumes no access to any downstream-task data, has been greatly advanced by using pretrained Vision-Language Models (VLMs). In this area, retrieval-augmented learning (RAL) effectively boosts zero-shot accuracy by retrieving and learning from external data relevant to downstream concepts. Motivated by these advancements, our work explores RAL for few-shot recognition. While seemingly straightforward despite being under-explored in the literature (till now!), we present novel challenges and opportunities for applying RAL for few-shot recognition. First, perhaps surprisingly, simply finetuning the VLM on a large amount of retrieved data barely surpasses state-of-the-art zero-shot methods due to the imbalanced distribution of retrieved data and its domain gaps compared to few-shot annotated data. Second, finetuning a VLM on few-shot examples alone significantly outperforms prior methods, and finetuning on the mix of retrieved and few-shot data yields even better results. Third, to mitigate the imbalanced distribution and domain gap issue, we propose Stage-Wise Augmented fineTuning (SWAT) method, which involves end-to-end finetuning on mixed data for the first stage and retraining the classifier solely on the few-shot data in the second stage. Extensive experiments show that SWAT achieves the best performance on standard benchmark datasets, resoundingly outperforming prior works by ~10% in accuracy. Code is available at https://github.com/tian1327/SWAT.
Qi-Jie Li, Qian Sun, Shao-Qun Zhang
Identifying gene splicing is a core and significant task confronted in modern collaboration between artificial intelligence and bioinformatics. Past decades have witnessed great efforts on this concern, such as the bio-plausible splicing pattern AT-CG and the famous SpliceAI. In this paper, we propose a novel framework for the task of gene splicing identification, named Horizon-wise Gene Splicing Identification (H-GSI). The proposed H-GSI follows the horizon-wise identification paradigm and comprises four components: the pre-processing procedure transforming string data into tensors, the sliding window technique handling long sequences, the SeqLab model, and the predictor. In contrast to existing studies that process gene information with a truncated fixed-length sequence, H-GSI employs a horizon-wise identification paradigm in which all positions in a sequence are predicted with only one forward computation, improving accuracy and efficiency. The experiments conducted on the real-world Human dataset show that our proposed H-GSI outperforms SpliceAI and achieves the best accuracy of 97.20\%. The source code is available from this link.
Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Biye Li, Huaian Chen, Yi Jin
We introduce STAR, a text-to-image model that employs a scale-wise
auto-regressive paradigm. Unlike VAR, which is constrained to class-conditioned
synthesis for images up to 256$\times$256, STAR enables text-driven image
generation up to 1024$\times$1024 through three key designs. First, we
introduce a pre-trained text encoder to extract and adopt representations for
textual constraints, enhancing details and generalizability. Second, given the
inherent structural correlation across different scales, we leverage 2D Rotary
Positional Encoding (RoPE) and tweak it into a normalized version, ensuring
consistent interpretation of relative positions across token maps and
stabilizing the training process. Third, we observe that simultaneously
sampling all tokens within a single scale can disrupt inter-token
relationships, leading to structural instability, particularly in
high-resolution generation. To address this, we propose a novel stable sampling
method that incorporates causal relationships into the sampling process,
ensuring both rich details and stable structures. Compared to previous
diffusion models and auto-regressive models, STAR surpasses existing benchmarks
in fidelity, text-image consistency, and aesthetic quality, requiring just
2.21s for 1024$\times$1024 images on A100. This highlights the potential of
auto-regressive methods in high-quality image synthesis, offering new
directions for the text-to-image generation.
Authors' comments: 16 pages
Bohan Lyu, Jianzhong Li
This paper introduces a new type of regression methodology named as Convex-Area-Wise Linear Regression(CALR), which separates given datasets by disjoint convex areas and fits different linear regression models for different areas. This regression model is highly interpretable, and it is able to interpolate any given datasets, even when the underlying relationship between explanatory and response variables are non-linear and discontinuous. In order to solve CALR problem, 3 accurate algorithms are proposed under different assumptions. The analysis of correctness and time complexity of the algorithms are given, indicating that the problem can be solved in $o(n^2)$ time accurately when the input datasets have some special features. Besides, this paper introduces an equivalent mixed integer programming problem of CALR which can be approximately solved using existing optimization solvers.