Ankit Pratap Singh, Namrata Vaswani
This letter studies the AltGDmin algorithm for solving the noisy low rank
column-wise sensing (LRCS) problem. Our sample complexity guarantee improves
upon the best existing one by a factor $\max(r, \log(1/\epsilon))/r$ where $r$
is the rank of the unknown matrix and $\epsilon$ is the final desired accuracy.
A second contribution of this work is a detailed comparison of guarantees from
all work that studies the exact same mathematical problem as LRCS, but refers
to it by different names.
Authors' comments: 8 pages
Rolf van der Hulst, Matthias Walter
Given a $\{0,1\}$-matrix $M$, the graph realization problem for $M$ asks if
there exists a spanning forest such that the columns of $M$ are incidence
vectors of paths in the forest. The problem is closely related to the
recognition of network matrices, which are a large subclass of totally
unimodular matrices and have many applications in mixed-integer programming.
Previously, Bixby and Wagner have designed an efficient algorithm for graph
realization that grows a submatrix in a column-wise fashion whilst maintaining
a graphic realization. This paper complements their work by providing an
algorithm that works in a row-wise fashion and uses similar data structures.
The main challenge in designing efficient algorithms for the graph realization
problem is ambiguity as there may exist many graphs realizing $M$. The key
insight for designing an efficient row-wise algorithm is that a graphic matrix
is uniquely represented by an SPQR tree, a graph decomposition that stores all
graphs with the same set of cycles. The developed row-wise algorithm uses data
structures that are compatible with the column-wise algorithm and can be
combined with the latter to detect maximal graphic submatrices.
Authors' comments: 40 pages, 10 figures
Sungyoon Kim, Youngjun Kim, Kihyo Moon, Minsung Jang
The advent of large language models has revolutionized natural language
processing, but their increasing complexity has led to substantial training
costs, resource demands, and environmental impacts. In response, sparse
Mixture-of-Experts (MoE) models have emerged as a promising alternative to
dense models. Since training MoE models from scratch can be prohibitively
expensive, recent studies have explored leveraging knowledge from pre-trained
non-MoE models. However, existing approaches have limitations, such as
requiring significant hardware resources and data. We propose a novel
algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model
into a MoE model with minimal additional training cost. LaDiMo consists of two
stages: layer-wise expert construction and routing policy decision. By
harnessing the concept of Knowledge Distillation, we compress the model and
rapidly recover its performance. Furthermore, we develop an adaptive router
that optimizes inference efficiency by profiling the distribution of routing
weights and determining a layer-wise policy that balances accuracy and latency.
We demonstrate the effectiveness of our method by converting the LLaMA2-7B
model to a MoE model using only 100K tokens, reducing activated parameters by
over 20% while keeping accuracy. Our approach offers a flexible and efficient
solution for building and deploying MoE models.
Authors' comments: 21 pages, 10 figures
Yusuf Sale, Paul Hofman, Timo Löhr, Lisa Wimmer, Thomas Nagler, Eyke Hüllermeier
We present a novel approach to uncertainty quantification in classification
tasks based on label-wise decomposition of uncertainty measures. This
label-wise perspective allows uncertainty to be quantified at the individual
class level, thereby improving cost-sensitive decision-making and helping
understand the sources of uncertainty. Furthermore, it allows to define total,
aleatoric, and epistemic uncertainty on the basis of non-categorical measures
such as variance, going beyond common entropy-based measures. In particular,
variance-based measures address some of the limitations associated with
established methods that have recently been discussed in the literature. We
show that our proposed measures adhere to a number of desirable properties.
Through empirical evaluation on a variety of benchmark data sets -- including
applications in the medical domain where accurate uncertainty quantification is
crucial -- we establish the effectiveness of label-wise uncertainty
quantification.
Authors' comments: Uncertainty in Artificial Intelligence. arXiv admin note: substantial
text overlap with arXiv:2401.00276
Ana Brassard, Benjamin Heinzerling, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui
Evaluating the quality of free-text explanations is a multifaceted,
subjective, and labor-intensive task. Large language models (LLMs) present an
appealing alternative due to their potential for consistency, scalability, and
cost-efficiency. In this work, we present ACORN, a new dataset of 3,500
free-text explanations and aspect-wise quality ratings, and use it to evaluate
how LLMs rate explanations. We observed that larger models outputted labels
that maintained or increased the inter-annotator agreement, suggesting that
they are within the expected variance between human raters. However, their
correlation with majority-voted human ratings varied across different quality
aspects, indicating that they are not a complete replacement. In turn, using
LLMs as a supplement to a smaller group of human raters in some cases improved
the correlation with the original majority labels. However, the effect was
limited to cases where human raters were scarce, and an additional human rater
had a more pronounced effect in all cases. Overall, we recommend against using
LLMs as a complete replacement for human raters but encourage using them in
configurations that end with targeted human involvement. Data available here:
https://github.com/a-brassard/ACORN
Authors' comments: 18 pages, 7 figures, accepted to COLM 2024. Data available here:
https://github.com/a-brassard/ACORN
Peter Frankl, Andrey Kupavskii
A family of subsets of $[n]$ is $r$-wise agreeing if for any $r$ sets from the family there is an element $x$ that is either contained in all or contained in none of the $r$ sets. The study of such families is motivated by questions in discrete optimization. In this paper, we determine the size of the largest non-trivial $r$-wise agreeing family. This can be seen as a generalization of the classical Brace-Daykin theorem.
Aldo Pacchiano, Mohammad Ghavamzadeh, Peter Bartlett
We study contextual bandits in the presence of a stage-wise constraint (a
constraint at each round), when the constraint must be satisfied both with high
probability and in expectation. Obviously the setting where the constraint is
in expectation is a relaxation of the one with high probability. We start with
the linear case where both the contextual bandit problem (reward function) and
the stage-wise constraint (cost function) are linear. In each of the high
probability and in expectation settings, we propose an upper-confidence bound
algorithm for the problem and prove a $T$-round regret bound for it. Our
algorithms balance exploration and constraint satisfaction using a novel idea
that scales the radii of the reward and cost confidence sets with different
scaling factors. We also prove a lower-bound for this constrained problem, show
how our algorithms and analyses can be extended to multiple constraints, and
provide simulations to validate our theoretical results. In the high
probability setting, we describe the minimum requirements for the action set in
order for our algorithm to be tractable. In the setting that the constraint is
in expectation, we further specialize our results to multi-armed bandits and
propose a computationally efficient algorithm for this setting with regret
analysis. Finally, we extend our results to the case where the reward and cost
functions are both non-linear. We propose an algorithm for this case and prove
a regret bound for it that characterize the function class complexity by the
eluder dimension.
Authors' comments: 53 pages. arXiv admin note: text overlap with arXiv:2006.10185
C. Contreras Peña, M. Ashraf, J. E. Lee, G. Herczeg, P. W. Lucas, Z. Guo, D. Johnstone, H. G. Lee et al.
This work presents four high-amplitude variable YSOs ($\simeq$ 3 mag at near-
or mid-IR wavelengths) arising from the SPICY catalog. Three outbursts show a
duration that is longer than 1 year, and are still ongoing. And additional YSO
brightened over the last two epochs of NEOWISE observations and the duration of
the outburst is thus unclear. Analysis of the spectra of the four sources
confirms them as new members of the eruptive variable class. We find two YSOs
that can be firmly classified as bona fide FUors and one object that falls in
the V1647 Ori-like class. Given the uncertainty in the duration of its
outburst, an additional YSO can only be classified as a candidate FUor.
Continued monitoring and follow-up of these particular sources is important to
better understand the accretion process of YSOs.
Authors' comments: 10 pages, 10 figures, Accepted for publication at the Journal of the
Korean Astronomical Society
Adam Dejl, Hamed Ayoobi, Matthew Williams, Francesca Toni
Feature attribution methods are widely used to explain neural models by determining the influence of individual input features on the models' outputs. We propose a novel feature attribution method, CAFE (Conflict-Aware Feature-wise Explanations), that addresses three limitations of the existing methods: their disregard for the impact of conflicting features, their lack of consideration for the influence of bias terms, and an overly high sensitivity to local variations in the underpinning activation functions. Unlike other methods, CAFE provides safeguards against overestimating the effects of neuron inputs and separately traces positive and negative influences of input features and biases, resulting in enhanced robustness and increased ability to surface feature conflicts. We show experimentally that CAFE is better able to identify conflicting features on synthetic tabular data and exhibits the best overall fidelity on several real-world tabular datasets, while being highly computationally efficient.
Tycho F. A. van der Ouderaa, Alexander Immer, Mark van der Wilk
Convolutions encode equivariance symmetries into neural networks leading to better generalisation performance. However, symmetries provide fixed hard constraints on the functions a network can represent, need to be specified in advance, and can not be adapted. Our goal is to allow flexible symmetry constraints that can automatically be learned from data using gradients. Learning symmetry and associated weight connectivity structures from scratch is difficult for two reasons. First, it requires efficient and flexible parameterisations of layer-wise equivariances. Secondly, symmetries act as constraints and are therefore not encouraged by training losses measuring data fit. To overcome these challenges, we improve parameterisations of soft equivariance and learn the amount of equivariance in layers by optimising the marginal likelihood, estimated using differentiable Laplace approximations. The objective balances data fit and model complexity enabling layer-wise symmetry discovery in deep networks. We demonstrate the ability to automatically learn layer-wise equivariances on image classification tasks, achieving equivalent or improved performance over baselines with hard-coded symmetry.
Yafang Zheng, Lei Lin, Zhaohong Lai, Binling Wang, Shan Liu, Biao Fu, Wenhao Rao, Peigen Ye et al.
Despite successes across a broad range of applications, sequence-to-sequence
models' construct of solutions are argued to be less compositional than
human-like generalization. There is mounting evidence that one of the reasons
hindering compositional generalization is representations of the encoder and
decoder uppermost layer are entangled. In other words, the syntactic and
semantic representations of sequences are twisted inappropriately. However,
most previous studies mainly concentrate on enhancing token-level semantic
information to alleviate the representations entanglement problem, rather than
composing and using the syntactic and semantic representations of sequences
appropriately as humans do. In addition, we explain why the entanglement
problem exists from the perspective of recent studies about training deeper
Transformer, mainly owing to the ``shallow'' residual connections and its
simple, one-step operations, which fails to fuse previous layers' information
effectively. Starting from this finding and inspired by humans' strategies, we
propose \textsc{FuSion} (\textbf{Fu}sing \textbf{S}yntactic and
Semant\textbf{i}c Representati\textbf{on}s), an extension to
sequence-to-sequence models to learn to fuse previous layers' information back
into the encoding and decoding process appropriately through introducing a
\emph{fuse-attention module} at each encoder and decoder layer. \textsc{FuSion}
achieves competitive and even \textbf{state-of-the-art} results on two
realistic benchmarks, which empirically demonstrates the effectiveness of our
proposal.
Authors' comments: work in progress. arXiv admin note: substantial text overlap with
arXiv:2305.12169
Roshan Sharma, Kenneth Zheng, Siddhant Arora, Shinji Watanabe, Rita Singh, Bhiksha Raj
End-to-end speech summarization has been shown to improve performance over
cascade baselines. However, such models are difficult to train on very large
inputs (dozens of minutes or hours) owing to compute restrictions and are hence
trained with truncated model inputs. Truncation leads to poorer models, and a
solution to this problem rests in block-wise modeling, i.e., processing a
portion of the input frames at a time. In this paper, we develop a method that
allows one to train summarization models on very long sequences in an
incremental manner. Speech summarization is realized as a streaming process,
where hypothesis summaries are updated every block based on new acoustic
information. We devise and test strategies to pass semantic context across the
blocks. Experiments on the How2 dataset demonstrate that the proposed
block-wise training method improves by 3 points absolute on ROUGE-L over a
truncated input baseline.
Authors' comments: Accepted at Interspeech 2023
Keke Huang, Jing Tang, Juncheng Liu, Renchi Yang, Xiaokui Xiao
Graph Neural Networks (GNNs) have shown superior performance for semi-supervised learning of numerous web applications, such as classification on web services and pages, analysis of online social networks, and recommendation in e-commerce. The state of the art derives representations for all nodes in graphs following the same diffusion (message passing) model without discriminating their uniqueness. However, (i) labeled nodes involved in model training usually account for a small portion of graphs in the semisupervised setting, and (ii) different nodes locate at different graph local contexts and it inevitably degrades the representation qualities if treating them undistinguishedly in diffusion. To address the above issues, we develop NDM, a universal node-wise diffusion model, to capture the unique characteristics of each node in diffusion, by which NDM is able to yield high-quality node representations. In what follows, we customize NDM for semisupervised learning and design the NIGCN model. In particular, NIGCN advances the efficiency significantly since it (i) produces representations for labeled nodes only and (ii) adopts well-designed neighbor sampling techniques tailored for node representation generation. Extensive experimental results on various types of web datasets, including citation, social and co-purchasing graphs, not only verify the state-of-the-art effectiveness of NIGCN but also strongly support the remarkable scalability of NIGCN. In particular, NIGCN completes representation generation and training within 10 seconds on the dataset with hundreds of millions of nodes and billions of edges, up to orders of magnitude speedups over the baselines, while achieving the highest F1-scores on classification.
Rodrigo Arruda, Bernardo Carvalho, Alberto Sarmiento
This paper discusses the dynamics of continuum-wise hyperbolic surface homeomorphisms. We prove that $cw_F$-hyperbolic surface homeomorphisms containing only a finite set of spines are $cw_2$-hyperbolic. In the case of $cw_3$-hyperbolic homeomorphisms we prove the finiteness of spines and, hence, that $cw_3$-hyperbolicity implies $cw_2$-hyperbolicity. In the proof, we adapt techniques of Hiraide [11] and Lewowicz [15] in the case of expansive surface homeomorphisms to prove that local stable/unstable continua of $cw_F$-hyperbolic homeomorphisms are continuous arcs. We also adapt techniques of Artigue, Pac\'ifico and Vieitez [6] in the case of N-expansive surface homeomorphisms to prove that the existence of spines is strongly related to the existence of bi-asymptotic sectors and conclude that spines are necessarily isolated from other spines.
Yuto Watanabe, Kazunori Sakurama
This study addresses a distributed optimization with a novel class of coupling of variables, called clique-wise coupling. A clique is a node set of a complete subgraph of an undirected graph. This setup is an extension of pairwise coupled optimization problems (e.g., consensus optimization) and allows us to handle coupling of variables consisting of more than two agents systematically. To solve this problem, we propose a clique-based linearized ADMM algorithm, which is proved to be distributed. Additionally, we consider objective functions given as a sum of nonsmooth and smooth convex functions and present a more flexible algorithm based on the FLiP-ADMM algorithm. Moreover, we provide convergence theorems of these algorithms. Notably, all the algorithmic parameters and the derived condition in the theorems depend only on local information, which means that each agent can choose the parameters in a distributed manner. Finally, we apply the proposed methods to a consensus optimization problem and demonstrate their effectiveness via numerical experiments.
Sri Charan Kattamuru, Kshitij Agrawal, Shyam Prasad Adhikari, Abhishek Bose, Hemant Misra
Images captured through smartphone cameras often suffer from degradation,
blur being one of the major ones, posing a challenge in processing these images
for downstream tasks. In this paper we propose low-compute lightweight
patch-wise features for image quality assessment. Using our method we can
discriminate between blur vs sharp image degradation. To this end, we train a
decision-tree based XGBoost model on various intuitive image features like gray
level variance, first and second order gradients, texture features like local
binary patterns. Experiments conducted on an open dataset show that the
proposed low compute method results in 90.1% mean accuracy on the validation
set, which is comparable to the accuracy of a compute-intensive VGG16 network
with 94% mean accuracy fine-tuned to this task. To demonstrate the
generalizability of our proposed features and model we test the model on BHBID
dataset and an internal dataset where we attain accuracy of 98% and 91%,
respectively. The proposed method is 10x faster than the VGG16 based model on
CPU and scales linearly to the input image size making it suitable to be
implemented on low compute edge devices.
Authors' comments: Accepted in AIMLSystems2022
Zeming Wei, Yifei Wang, Yiwen Guo, Yisen Wang
Adversarial training has been widely acknowledged as the most effective
method to improve the adversarial robustness against adversarial examples for
Deep Neural Networks (DNNs). So far, most existing works focus on enhancing the
overall model robustness, treating each class equally in both the training and
testing phases. Although revealing the disparity in robustness among classes,
few works try to make adversarial training fair at the class level without
sacrificing overall robustness. In this paper, we are the first to
theoretically and empirically investigate the preference of different classes
for adversarial configurations, including perturbation margin, regularization,
and weight averaging. Motivated by this, we further propose a
\textbf{C}lass-wise calibrated \textbf{F}air \textbf{A}dversarial training
framework, named CFA, which customizes specific training configurations for
each class automatically. Experiments on benchmark datasets demonstrate that
our proposed CFA can improve both overall robustness and fairness notably over
other state-of-the-art methods. Code is available at
\url{https://github.com/PKU-ML/CFA}.
Authors' comments: CVPR 2023
Kenyu Kobayashi, Renata Khasanova, Arno Schneuwly, Felix Schmidt, Matteo Casserini
Autoencoders are a powerful and versatile tool often used for various problems such as anomaly detection, image processing and machine translation. However, their reconstructions are not always trivial to explain. Therefore, we propose a fast explainability solution by extending the Layer-wise Relevance Propagation method with the help of Deep Taylor Decomposition framework. Furthermore, we introduce a novel validation technique for comparing our explainability approach with baseline methods in the case of missing ground-truth data. Our results highlight computational as well as qualitative advantages of the proposed explainability solution with respect to existing methods.
Kai Zhang, Yutong Dai, Hongyi Wang, Eric Xing, Xun Chen, Lichao Sun
Federated learning is a promising paradigm that allows multiple clients to collaboratively train a model without sharing the local data. However, the presence of heterogeneous devices in federated learning, such as mobile phones and IoT devices with varying memory capabilities, would limit the scale and hence the performance of the model could be trained. The mainstream approaches to address memory limitations focus on width-slimming techniques, where different clients train subnetworks with reduced widths locally and then the server aggregates the subnetworks. The global model produced from these methods suffers from performance degradation due to the negative impact of the actions taken to handle the varying subnetwork widths in the aggregation phase. In this paper, we introduce a memory-adaptive depth-wise learning solution in FL called FeDepth, which adaptively decomposes the full model into blocks according to the memory budgets of each client and trains blocks sequentially to obtain a full inference model. Our method outperforms state-of-the-art approaches, achieving 5% and more than 10% improvements in top-1 accuracy on CIFAR-10 and CIFAR-100, respectively. We also demonstrate the effectiveness of depth-wise fine-tuning on ViT. Our findings highlight the importance of memory-aware techniques for federated learning with heterogeneous devices and the success of depth-wise training strategy in improving the global model's performance.
Yucheng Xu, Mohammadreza Kasaei, Hamidreza Kasaei, Zhibin Li
Generating high-quality instance-wise grasp configurations provides critical information of how to grasp specific objects in a multi-object environment and is of high importance for robot manipulation tasks. This work proposed a novel \textbf{S}ingle-\textbf{S}tage \textbf{G}rasp (SSG) synthesis network, which performs high-quality instance-wise grasp synthesis in a single stage: instance mask and grasp configurations are generated for each object simultaneously. Our method outperforms state-of-the-art on robotic grasp prediction based on the OCID-Grasp dataset, and performs competitively on the JACQUARD dataset. The benchmarking results showed significant improvements compared to the baseline on the accuracy of generated grasp configurations. The performance of the proposed method has been validated through both extensive simulations and real robot experiments for three tasks including single object pick-and-place, grasp synthesis in cluttered environments and table cleaning task.