Yanan Wu, Jie Liu, Xingyuan Bu, Jiaheng Liu, Zhanhui Zhou, Yuanxing Zhang, Chenchen Zhang, Zhiqi Bai et al.
This paper introduces ConceptMath, a bilingual (English and Chinese),
fine-grained benchmark that evaluates concept-wise mathematical reasoning of
Large Language Models (LLMs). Unlike traditional benchmarks that evaluate
general mathematical reasoning with an average accuracy, ConceptMath
systematically organizes math problems under a hierarchy of math concepts, so
that mathematical reasoning can be evaluated at different granularity with
concept-wise accuracies. Based on our ConcepthMath, we evaluate a broad range
of LLMs, and we observe existing LLMs, though achieving high average accuracies
on traditional benchmarks, exhibit significant performance variations across
different math concepts and may even fail catastrophically on the most basic
ones. Besides, we also introduce an efficient fine-tuning strategy to enhance
the weaknesses of existing LLMs. Finally, we hope ConceptMath could guide the
developers to understand the fine-grained mathematical abilities of their
models and facilitate the growth of foundation models.
Authors' comments: The benchmark dataset will be released soon
Chen Shenglun, Zhang Hong, Ma XinZhu, Wang Zhihui, Li Haojie
Depth completion is a long-standing challenge in computer vision, where
classification-based methods have made tremendous progress in recent years.
However, most existing classification-based methods rely on pre-defined
pixel-shared and discrete depth values as depth categories. This representation
fails to capture the continuous depth values that conform to the real depth
distribution, leading to depth smearing in boundary regions. To address this
issue, we revisit depth completion from the clustering perspective and propose
a novel clustering-based framework called CluDe which focuses on learning the
pixel-wise and continuous depth representation. The key idea of CluDe is to
iteratively update the pixel-shared and discrete depth representation to its
corresponding pixel-wise and continuous counterpart, driven by the real depth
distribution. Specifically, CluDe first utilizes depth value clustering to
learn a set of depth centers as the depth representation. While these depth
centers are pixel-shared and discrete, they are more in line with the real
depth distribution compared to pre-defined depth categories. Then, CluDe
estimates offsets for these depth centers, enabling their dynamic adjustment
along the depth axis of the depth distribution to generate the pixel-wise and
continuous depth representation. Extensive experiments demonstrate that CluDe
successfully reduces depth smearing around object boundaries by utilizing
pixel-wise and continuous depth representation. Furthermore, CluDe achieves
state-of-the-art performance on the VOID datasets and outperforms
classification-based methods on the KITTI dataset.
Authors' comments: Published in IEEE TCSVT,15 pages,12 figures
Xiaosa Li, Runze Zhao, Chengyue Lu, Xiao Xiao, Wenbo Ding
Surface vibration tactile feedback is capable of conveying various semantic information to humans via the handheld electronic devices, like smartphone, touch panel,and game controller. However, covering the whole device contacting surface with dense actuator arrangement can affect its normal use, how to produce desired vibration patterns at any contact point with only several sparse actuators deployed on the handled device surface remains a significant challenge. In this work, we develop a tactile feedback board with only five actuators in the size of a smartphone, and achieve the precise vibration pattern production that can focus at any desired position all over the board. Specifically, we investigate the vibration characteristics of single passive coil actuator, and construct its vibration pattern model at any position on the feedback board surface. Optimal phase and amplitude modulation, found with the simulated annealing algorithm, is employed with five actuators in a sparse array. And all actuators' vibration patterns are superimposed linearly to synthetically generate different onboard vibration energy distribution for tactile sensing. Experiments demonstrated that for point-wise vibration pattern production on our tactile board achieved an average level of about 0.9 in the Structural Similarity Index Measure (SSIM) evaluation, when compared to the ideal single-point-focused target vibration pattern. The sparse actuator array can be easily embedded into usual handheld electronic devices, which shows a good significant implication for enriching their haptic interaction functionalities.
AprilPyone MaungMaung, Huy H. Nguyen, Hitoshi Kiya, Isao Echizen
We propose a method for generating spurious features by leveraging large-scale text-to-image diffusion models. Although the previous work detects spurious features in a large-scale dataset like ImageNet and introduces Spurious ImageNet, we found that not all spurious images are spurious across different classifiers. Although spurious images help measure the reliance of a classifier, filtering many images from the Internet to find more spurious features is time-consuming. To this end, we utilize an existing approach of personalizing large-scale text-to-image diffusion models with available discovered spurious images and propose a new spurious feature similarity loss based on neural features of an adversarially robust model. Precisely, we fine-tune Stable Diffusion with several reference images from Spurious ImageNet with a modified objective incorporating the proposed spurious-feature similarity loss. Experiment results show that our method can generate spurious images that are consistently spurious across different classifiers. Moreover, the generated spurious images are visually similar to reference images from Spurious ImageNet.
Yuji Roh, Qingyun Liu, Huan Gui, Zhe Yuan, Yujin Tang, Steven Euijong Whang, Liang Liu, Shuchao Bi et al.
Fine-tuning is becoming widely used for leveraging the power of pre-trained foundation models in new downstream tasks. While there are many successes of fine-tuning on various tasks, recent studies have observed challenges in the generalization of fine-tuned models to unseen distributions (i.e., out-of-distribution; OOD). To improve OOD generalization, some previous studies identify the limitations of fine-tuning data and regulate fine-tuning to preserve the general representation learned from pre-training data. However, potential limitations in the pre-training data and models are often ignored. In this paper, we contend that overly relying on the pre-trained representation may hinder fine-tuning from learning essential representations for downstream tasks and thus hurt its OOD generalization. It can be especially catastrophic when new tasks are from different (sub)domains compared to pre-training data. To address the issues in both pre-training and fine-tuning data, we propose a novel generalizable fine-tuning method LEVI, where the pre-trained model is adaptively ensembled layer-wise with a small task-specific model, while preserving training and inference efficiencies. By combining two complementing models, LEVI effectively suppresses problematic features in both the fine-tuning data and pre-trained model and preserves useful features for new tasks. Broad experiments with large language and vision models show that LEVI greatly improves fine-tuning generalization via emphasizing different views from fine-tuning data and pre-trained features.
Umut Cem Entok, Firas Laakom, Farhad Pakdaman, Moncef Gabbouj
Most scenes are illuminated by several light sources, where the traditional
assumption of uniform illumination is invalid. This issue is ignored in most
color constancy methods, primarily due to the complex spatial impact of
multiple light sources on the image. Moreover, most existing multi-illuminant
methods fail to preserve the smooth change of illumination, which stems from
spatial dependencies in natural images. Motivated by this, we propose a novel
multi-illuminant color constancy method, by learning pixel-wise illumination
maps caused by multiple light sources. The proposed method enforces smoothness
within neighboring pixels, by regularizing the training with the total
variation loss. Moreover, a bilateral filter is provisioned further to enhance
the natural appearance of the estimated images, while preserving the edges.
Additionally, we propose a label-smoothing technique that enables the model to
generalize well despite the uncertainties in ground truth. Quantitative and
qualitative experiments demonstrate that the proposed method outperforms the
state-of-the-art.
Authors' comments: Copyright 2024 IEEE - Submitted to IEEE ICIP 2024
Alexandra Saliba, Yuanchao Li, Ramon Sanabria, Catherine Lai
The efficacy of self-supervised speech models has been validated, yet the
optimal utilization of their representations remains challenging across diverse
tasks. In this study, we delve into Acoustic Word Embeddings (AWEs), a
fixed-length feature derived from continuous representations, to explore their
advantages in specific tasks. AWEs have previously shown utility in capturing
acoustic discriminability. In light of this, we propose measuring layer-wise
similarity between AWEs and word embeddings, aiming to further investigate the
inherent context within AWEs. Moreover, we evaluate the contribution of AWEs,
in comparison to other types of speech features, in the context of Speech
Emotion Recognition (SER). Through a comparative experiment and a layer-wise
accuracy analysis on two distinct corpora, IEMOCAP and ESD, we explore
differences between AWEs and raw self-supervised representations, as well as
the proper utilization of AWEs alone and in combination with word embeddings.
Our findings underscore the acoustic context conveyed by AWEs and showcase the
highly competitive SER accuracies by appropriately employing AWEs.
Authors' comments: Accepted to ICASSP2024 Self-supervision in Audio, Speech and Beyond
(SASB) workshop. First two authors contributed equally
Snir Ben Ovadia
We introduce the notion of tubular dimension, and give a formula for it. As an application we show that every invariant measure of a $C^{1+\gamma}$ diffeomorphism of a closed Riemannian manifold admits an asymptotic local product structure for conditional measures on intermediate foliations of unstable leaves. As a second application, we prove a bound on the gap between any two consecutive conditional entropies, in the form of volume growth. As a third application, for certain $C^\infty$ maps we compute all conditional entropies for the measure of maximal entropy; And in particular as a consequence, in a follow-up paper we compute the Hausdorff dimension of the equilibrium measure of holomorphic endomorphisms of $\mathbb{C}\mathbb{P}^k$, $k\geq 1$, giving a solution to the Binder-DeMarco conjecture, and answering a question of Forn{\ae}ss and Sibony.
Danning Lao, Qi Liu, Jiazi Bu, Junchi Yan, Wei Shen
As computer vision continues to advance and finds widespread applications across various domains, the need for interpretability in deep learning models becomes paramount. Existing methods often resort to post-hoc techniques or prototypes to explain the decision-making process, which can be indirect and lack intrinsic illustration. In this research, we introduce ViTree, a novel approach for fine-grained visual categorization that combines the popular vision transformer as a feature extraction backbone with neural decision trees. By traversing the tree paths, ViTree effectively selects patches from transformer-processed features to highlight informative local regions, thereby refining representations in a step-wise manner. Unlike previous tree-based models that rely on soft distributions or ensembles of paths, ViTree selects a single tree path, offering a clearer and simpler decision-making process. This patch and path selectivity enhances model interpretability of ViTree, enabling better insights into the model's inner workings. Remarkably, extensive experimentation validates that this streamlined approach surpasses various strong competitors and achieves state-of-the-art performance while maintaining exceptional interpretability which is proved by multi-perspective methods. Code can be found at https://github.com/SJTU-DeepVisionLab/ViTree.
Chak Fong Chong, Xinyi Fang, Jielong Guo, Yapeng Wang, Wei Ke, Chan-Tong Lam, Sio-Kei Im
Large-scale image datasets are often partially labeled, where only a few categories' labels are known for each image. Assigning pseudo-labels to unknown labels to gain additional training signals has become prevalent for training deep classification models. However, some pseudo-labels are inevitably incorrect, leading to a notable decline in the model classification performance. In this paper, we propose a novel method called Category-wise Fine-Tuning (CFT), aiming to reduce model inaccuracies caused by the wrong pseudo-labels. In particular, CFT employs known labels without pseudo-labels to fine-tune the logistic regressions of trained models individually to calibrate each category's model predictions. Genetic Algorithm, seldom used for training deep models, is also utilized in CFT to maximize the classification performance directly. CFT is applied to well-trained models, unlike most existing methods that train models from scratch. Hence, CFT is general and compatible with models trained with different methods and schemes, as demonstrated through extensive experiments. CFT requires only a few seconds for each category for calibration with consumer-grade GPUs. We achieve state-of-the-art results on three benchmarking datasets, including the CheXpert chest X-ray competition dataset (ensemble mAUC 93.33%, single model 91.82%), partially labeled MS-COCO (average mAP 83.69%), and Open Image V3 (mAP 85.31%), outperforming the previous bests by 0.28%, 2.21%, 2.50%, and 0.91%, respectively. The single model on CheXpert has been officially evaluated by the competition server, endorsing the correctness of the result. The outstanding results and generalizability indicate that CFT could be substantial and prevalent for classification model development. Code is available at: https://github.com/maxium0526/category-wise-fine-tuning.
Erik Duse
In this work we provide a survey of Fuglede's flux extensions of first order
partial differential operators, a concept largely forgotten today. A long the
way we also survey the classical weak and strong extensions of PDE operators
and the works of Friedrichs and H\"ormander. We give several applications of
this theory showing its usefulness, as well as connecting it to more recent
developments in connection to various sharp versions of the divergence theorem.
In particular, we use it to prove a generalization of Morera's theorem valid
for general first order operators. Using this theory we also prove a new local
limit formula for the maximal extension of a first order operator. We initiate
a study of this limit and connect it to the wave cone of the operator, a
concept that first arose in the theory of compensated compactness. Hopefully,
this will contribute to a rival of Fuglede's beautiful ideas.
Authors' comments: Feedback is welcome! Typos fixed
Nachuan Ma, Rui Fan, Lihua Xie
Over the past decade, automated methods have been developed to detect cracks more efficiently, accurately, and objectively, with the ultimate goal of replacing conventional manual visual inspection techniques. Among these methods, semantic segmentation algorithms have demonstrated promising results in pixel-wise crack detection tasks. However, training such data-driven algorithms requires a large amount of human-annotated datasets with pixel-level annotations, which is a highly labor-intensive and time-consuming process. Moreover, supervised learning-based methods often struggle with poor generalization ability in unseen datasets. Therefore, we propose an unsupervised pixel-wise road crack detection network, known as UP-CrackNet. Our approach first generates multi-scale square masks and randomly selects them to corrupt undamaged road images by removing certain regions. Subsequently, a generative adversarial network is trained to restore the corrupted regions by leveraging the semantic context learned from surrounding uncorrupted regions. During the testing phase, an error map is generated by calculating the difference between the input and restored images, which allows for pixel-wise crack detection. Our comprehensive experimental results demonstrate that UP-CrackNet outperforms other general-purpose unsupervised anomaly detection algorithms, and exhibits comparable performance and superior generalizability when compared with state-of-the-art supervised crack segmentation algorithms. Our source code is publicly available at mias.group/UP-CrackNet.
Haiyang Peng, Yi Zhan, Benkang Wang, Hongtao Zhang
In High-definition (HD) maps, lane elements constitute the majority of components and demand stringent localization requirements to ensure safe vehicle navigation. Vision lane detection with LiDAR position assignment is a prevalent method to acquire initial lanes for HD maps. However, due to incorrect vision detection and coarse camera-LiDAR calibration, initial lanes may deviate from their true positions within an uncertain range. To mitigate the need for manual lane correction, we propose a patch-wise lane correction network (PLCNet) to automatically correct the positions of initial lane points in local LiDAR images that are transformed from point clouds. PLCNet first extracts multi-scale image features and crops patch (ROI) features centered at each initial lane point. By applying ROIAlign, the fix-sized ROI features are flattened into 1D features. Then, a 1D lane attention module is devised to compute instance-level lane features with adaptive weights. Finally, lane correction offsets are inferred by a multi-layer perceptron and used to correct the initial lane positions. Considering practical applications, our automatic method supports merging local corrected lanes into global corrected lanes. Through extensive experiments on a self-built dataset, we demonstrate that PLCNet achieves fast and effective initial lane correction.
Xinliang Frederick Zhang, Carter Blum, Temma Choji, Shalin Shah, Alakananda Vempala
Structural extraction of events within discourse is critical since it avails
a deeper understanding of communication patterns and behavior trends. Event
argument extraction (EAE), at the core of event-centric understanding, is the
task of identifying role-specific text spans (i.e., arguments) for a given
event. Document-level EAE (DocEAE) focuses on arguments that are scattered
across an entire document. In this work, we explore open-source Large Language
Models (LLMs) for DocEAE, and propose ULTRA, a hierarchical framework that
extracts event arguments more cost-effectively. Further, it alleviates the
positional bias issue intrinsic to LLMs. ULTRA sequentially reads text chunks
of a document to generate a candidate argument set, upon which non-pertinent
candidates are dropped through self-refinement. We introduce LEAFER to address
the challenge LLMs face in locating the exact boundary of an argument. ULTRA
outperforms strong baselines, including strong supervised models and ChatGPT,
by 9.8% when evaluated by Exact Match (EM).
Authors' comments: ACL'24 Findings
Jiabin Lin, Shana Moothedath
We present conservative distributed multi-task learning in stochastic linear contextual bandits with heterogeneous agents. This extends conservative linear bandits to a distributed setting where M agents tackle different but related tasks while adhering to stage-wise performance constraints. The exact context is unknown, and only a context distribution is available to the agents as in many practical applications that involve a prediction mechanism to infer context, such as stock market prediction and weather forecast. We propose a distributed upper confidence bound (UCB) algorithm, DiSC-UCB. Our algorithm constructs a pruned action set during each round to ensure the constraints are met. Additionally, it includes synchronized sharing of estimates among agents via a central server using well-structured synchronization steps. We prove the regret and communication bounds on the algorithm. We extend the problem to a setting where the agents are unaware of the baseline reward. For this setting, we provide a modified algorithm, DiSC-UCB2, and we show that the modified algorithm achieves the same regret and communication bounds. We empirically validated the performance of our algorithm on synthetic data and real-world Movielens-100K data.
Navin Ranjan, Andreas Savakis
Vision transformers (ViTs) have demonstrated remarkable performance across various visual tasks. However, ViT models suffer from substantial computational and memory requirements, making it challenging to deploy them on resource-constrained platforms. Quantization is a popular approach for reducing model size, but most studies mainly focus on equal bit-width quantization for the entire network, resulting in sub-optimal solutions. While there are few works on mixed precision quantization (MPQ) for ViTs, they typically rely on search space-based methods or employ mixed precision arbitrarily. In this paper, we introduce LRP-QViT, an explainability-based method for assigning mixed-precision bit allocations to different layers based on their importance during classification. Specifically, to measure the contribution score of each layer in predicting the target class, we employ the Layer-wise Relevance Propagation (LRP) method. LRP assigns local relevance at the output layer and propagates it through all layers, distributing the relevance until it reaches the input layers. These relevance scores serve as indicators for computing the layer contribution score. Additionally, we have introduced a clipped channel-wise quantization aimed at eliminating outliers from post-LayerNorm activations to alleviate severe inter-channel variations. To validate and assess our approach, we employ LRP-QViT across ViT, DeiT, and Swin transformer models on various datasets. Our experimental findings demonstrate that both our fixed-bit and mixed-bit post-training quantization methods surpass existing models in the context of 4-bit and 6-bit quantization.
Pawan Kumar, Prateek Dwivedi, Sobiya Ashraf, Dipin Pillai, Rahul Mangal
Self-propelled droplets serve as ideal model systems to delve deeper into
understanding of the motion of biological micro-swimmers by simulating their
motility. Biological microorganisms are renowned for showcasing a diverse array
of dynamic swimming behaviors when confronted with physical constraints. This
study aims to elucidate the impact of physical constraints on swimming
characteristics of biological microorganisms. To achieve this, we present
observations on the individual and pair-wise behavior of micellar solubilized
self-propelled 4-Cyano-4'-pentyl-biphenyl (5CB) oil droplets in a square
capillary channel filled with a surfactant trimethyl ammonium bromide (TTAB)
aqueous solution. To explore the effect of the underlying P\'eclet ($Pe$)
number of the swimming droplets, the study is also performed in the presence of
additives such as high molecular weight polymer Polyethylene oxide (PEO) and
molecular solute glycerol. The capillary confinement restricts droplet to
predominantly one-dimensional (1D) motion, albeit with noticeable differences
in their motion across the three scenarios. Through a characterization of the
chemical and hydrodynamic flow fields surrounding the droplets, we illustrate
that the modification of the droplets' chemical field due to confinement varies
significantly based on the underlying differences in the P\'eclet number ($Pe$)
in these cases. This alteration in the chemical field distribution notably
affects the individual droplets' motion. Moreover, these distinct chemical
field interactions between the droplets also lead to variations in their
pair-wise motion, ranging from behaviors like chasing to scattering.
Authors' comments: 13 pages, 9 figures
Haonan Yu, Wei Xu
Unsupervised video object learning seeks to decompose video scenes into
structural object representations without any supervision from depth, optical
flow, or segmentation. We present VONet, an innovative approach that is
inspired by MONet. While utilizing a U-Net architecture, VONet employs an
efficient and effective parallel attention inference process, generating
attention masks for all slots simultaneously. Additionally, to enhance the
temporal consistency of each mask across consecutive video frames, VONet
develops an object-wise sequential VAE framework. The integration of these
innovative encoder-side techniques, in conjunction with an expressive
transformer-based decoder, establishes VONet as the leading unsupervised method
for object learning across five MOVI datasets, encompassing videos of diverse
complexities. Code is available at https://github.com/hnyu/vonet.
Authors' comments: ICLR 2024
Daniel Campbell
We present three novel classifications of the weak sequential (and strong) limits in $W^{1,p}$ of planar diffeomorphisms. We introduce a concept called the QM condition which is a kind of separation property for pre-images of closed connected sets and show that $u$ satisfies this property exactly when it is the limit of Sobolev homeomorphisms. Further, we prove that $u\in W^{1,p}_{\operatorname{id}}((-1,1)^2,\mathbb{R}^2)$ is the limit of a sequence of homeomorphisms exactly when there are classically monotone mappings $g_{\delta}:[-1,1]^2\to \mathbb{R}^2$ and very small open sets $U_{\delta}$ such that $g_{\delta} = u$ on $[-1,1]^2 \setminus U_{\delta}$. Also, we introduce the so-called three curve condition, which is in some sense reminiscent of the NCL condition of \cite{CPR} but for $u^{-1}$ instead of for $u$, and prove that a map is the $W^{1,p}$ limit of planar Sobolev homeomorphisms exactly when it satisfies this property. This improves on results in \cite{DPP} answering the question from \cite{IO2}.
Yadong Guan, Jiqing Han, Hongwei Song, Wenjie Song, Guibin Zheng, Tieran Zheng, Yongjun He
Overlapping sound events are ubiquitous in real-world environments, but
existing end-to-end sound event detection (SED) methods still struggle to
detect them effectively. A critical reason is that these methods represent
overlapping events using shared and entangled frame-wise features, which
degrades the feature discrimination. To solve the problem, we propose a
disentangled feature learning framework to learn a category-specific
representation. Specifically, we employ different projectors to learn the
frame-wise features for each category. To ensure that these feature does not
contain information of other categories, we maximize the common information
between frame-wise features within the same category and propose a frame-wise
contrastive loss. In addition, considering that the labeled data used by the
proposed method is limited, we propose a semi-supervised frame-wise contrastive
loss that can leverage large amounts of unlabeled data to achieve feature
disentanglement. The experimental results demonstrate the effectiveness of our
method.
Authors' comments: accepted by icassp2024