Daxin Li, Yuanchao Bai, Kai Wang, Junjun Jiang, Xianming Liu, Wen Gao
Transformer-based entropy models have gained prominence in recent years due
to their superior ability to capture long-range dependencies in probability
distribution estimation compared to convolution-based methods. However,
previous transformer-based entropy models suffer from a sluggish coding process
due to pixel-wise autoregression or duplicated computation during inference. In
this paper, we propose a novel transformer-based entropy model called
GroupedMixer, which enjoys both faster coding speed and better compression
performance than previous transformer-based methods. Specifically, our approach
builds upon group-wise autoregression by first partitioning the latent
variables into groups along spatial-channel dimensions, and then entropy coding
the groups with the proposed transformer-based entropy model. The global causal
self-attention is decomposed into more efficient group-wise interactions,
implemented using inner-group and cross-group token-mixers. The inner-group
token-mixer incorporates contextual elements within a group while the
cross-group token-mixer interacts with previously decoded groups. Alternate
arrangement of two token-mixers enables global contextual reference. To further
expedite the network inference, we introduce context cache optimization to
GroupedMixer, which caches attention activation values in cross-group
token-mixers and avoids complex and duplicated computation. Experimental
results demonstrate that the proposed GroupedMixer yields the state-of-the-art
rate-distortion performance with fast compression speed.
Authors' comments: Accepted by IEEE TCSVT
Jinming Cao, Sicheng Shen, Qiu Zhou, Yifang Yin, Yangyan Li, Roger Zimmermann
Photographing optoelectronic displays often introduces unwanted moir\'e
patterns due to analog signal interference between the pixel grids of the
display and the camera sensor arrays. This work identifies two problems that
are largely ignored by existing image demoir\'eing approaches: 1) moir\'e
patterns vary across different channels (RGB); 2) repetitive patterns are
constantly observed. However, employing conventional convolutional (CNN) layers
cannot address these problems. Instead, this paper presents the use of our
recently proposed \emph{Shape} concept. It was originally employed to model
consistent features from fragmented regions, particularly when identical or
similar objects coexist in an RGB-D image. Interestingly, we find that the
Shape information effectively captures the moir\'e patterns in artifact images.
Motivated by this discovery, we propose a new method, ShapeMoir\'e, for image
demoir\'eing. Beyond modeling shape features at the patch-level, we further
extend this to the global image-level and design a novel Shape-Architecture.
Consequently, our proposed method, equipped with both ShapeConv and
Shape-Architecture, can be seamlessly integrated into existing approaches
without introducing any additional parameters or computation overhead during
inference. We conduct extensive experiments on four widely used datasets, and
the results demonstrate that our ShapeMoir\'e achieves state-of-the-art
performance, particularly in terms of the PSNR metric.
Authors' comments: 19 pages
Shi-Yu Xia, Wenxuan Zhu, Xu Yang, Xin Geng
In practice, we usually need to build variable-sized models adapting for diverse resource constraints in different application scenarios, where weight initialization is an important step prior to training. The Learngene framework, introduced recently, firstly learns one compact part termed as learngene from a large well-trained model, after which learngene is expanded to initialize variable-sized models. In this paper, we start from analysing the importance of guidance for the expansion of well-trained learngene layers, inspiring the design of a simple but highly effective Learngene approach termed SWS (Stage-wise Weight Sharing), where both learngene layers and their learning process critically contribute to providing knowledge and guidance for initializing models at varying scales. Specifically, to learn learngene layers, we build an auxiliary model comprising multiple stages where the layer weights in each stage are shared, after which we train it through distillation. Subsequently, we expand these learngene layers containing stage information at their corresponding stage to initialize models of variable depths. Extensive experiments on ImageNet-1K demonstrate that SWS achieves consistent better performance compared to many models trained from scratch, while reducing around 6.6x total training costs. In some cases, SWS performs better only after 1 epoch tuning. When initializing variable-sized models adapting for different resource constraints, SWS achieves better results while reducing around 20x parameters stored to initialize these models and around 10x pre-training costs, in contrast to the pre-training and fine-tuning approach.
Hao Miao, Senzhang Wang, Meiyue Zhang, Diansheng Guo, Funing Sun, Fan Yang
Accurately forecasting traffic flows is critically important to many real
applications including public safety and intelligent transportation systems.
The challenges of this problem include both the dynamic mobility patterns of
the people and the complex spatial-temporal correlations of the urban traffic
data. Meanwhile, most existing models ignore the diverse impacts of the various
traffic observations (e.g. vehicle speed and road occupancy) on the traffic
flow prediction, and different traffic observations can be considered as
different channels of input features. We argue that the analysis in
multiple-channel traffic observations might help to better address this
problem. In this paper, we study the novel problem of multi-channel traffic
flow prediction, and propose a deep \underline{M}ulti-\underline{V}iew
\underline{C}hannel-wise \underline{S}patio-\underline{T}emporal
\underline{Net}work (MVC-STNet) model to effectively address it. Specifically,
we first construct the localized and globalized spatial graph where the
multi-view fusion module is used to effectively extract the local and global
spatial dependencies. Then LSTM is used to learn the temporal correlations. To
effectively model the different impacts of various traffic observations on
traffic flow prediction, a channel-wise graph convolutional network is also
designed. Extensive experiments are conducted over the PEMS04 and PEMS08
datasets. The results demonstrate that the proposed MVC-STNet outperforms
state-of-the-art methods by a large margin.
Authors' comments: Accepted by AAAI2020 workshop
Federico Marocco, J. Davy Kirkpatrick, Adam C. Schneider, Aaron M. Meisner, Mark Popinchalk, Christopher R. Gelino, Jacqueline K. Faherty, Adam J. Burgasser et al.
We present the discovery of 13 new widely separated T dwarf companions to M
dwarf primaries, identified using WISE/NEOWISE data by the CatWISE and Backyard
Worlds: Planet 9 projects. This sample represents a $\sim$60% increase in the
number of known M+T systems, and allows us to probe the most extreme products
of binary/planetary system formation, a discovery space made available by the
CatWISE2020 catalog and the Backyard Worlds: Planet 9 effort. Highlights among
the sample are WISEP J075108.79-763449.6, a previously known T9 thought to be
old due to its SED, which we now find is part of a common-proper-motion pair
with L 34-26 A, a well studied young M3 V star within 10 pc of the Sun; CWISE
J054129.32-745021.5 B and 2MASS J05581644-4501559 B, two T8 dwarfs possibly
associated with the very fast-rotating M4 V stars CWISE J054129.32-745021.5 A
and 2MASS J05581644-4501559 A; and UCAC3 52-1038 B, which is among the widest
late T companions to main sequence stars, with a projected separation of
$\sim$7100 au. The new benchmarks presented here are prime $JWST$ targets, and
can help us place strong constraints on formation and evolution theory of
substellar objects as well as on atmospheric models for these cold exoplanet
analogs.
Authors' comments: Accepted for publication in ApJ. 35 pages, 6 tables, 21 figures
Paulo Yanez Sarmiento, Simon Witzke, Nadja Klein, Bernhard Y. Renard
Explainability is a key component in many applications involving deep neural
networks (DNNs). However, current explanation methods for DNNs commonly leave
it to the human observer to distinguish relevant explanations from spurious
noise. This is not feasible anymore when going from easily human-accessible
data such as images to more complex data such as genome sequences. To
facilitate the accessibility of DNN outputs from such complex data and to
increase explainability, we present a modification of the widely used
explanation method layer-wise relevance propagation. Our approach enforces
sparsity directly by pruning the relevance propagation for the different
layers. Thereby, we achieve sparser relevance attributions for the input
features as well as for the intermediate layers. As the relevance propagation
is input-specific, we aim to prune the relevance propagation rather than the
underlying model architecture. This allows to prune different neurons for
different inputs and hence, might be more appropriate to the local nature of
explanation methods. To demonstrate the efficacy of our method, we evaluate it
on two types of data, images and genomic sequences. We show that our
modification indeed leads to noise reduction and concentrates relevance on the
most important features compared to the baseline.
Authors' comments: 15 pages, 5 figures
Davide Materia, Leonardo Ratini, Celestino Angeli, Leonardo Guidoni
The intersection of Quantum Chemistry and Quantum Computing has led to significant advancements in understanding the potential of using quantum devices for the efficient calculation of molecular energies. Simultaneously, this intersection is enhancing the comprehension of quantum chemical properties through the use of quantum computing and quantum information tools. This paper tackles a key question in this relationship: Is the nature of the orbital-wise electron correlations in wavefunctions of realistic prototypical cases classical or quantum? We delve into this inquiry with a comprehensive examination of molecular wavefunctions using Shannon and von Neumann entropies, alongside classical and quantum information theory. Our analysis reveals a notable distinction between classical and quantum mutual information in molecular systems when analyzed with Hartree-Fock canonical orbitals. However, this difference decreases dramatically, by approximately 100-fold, when Natural Orbitals are used as reference. This finding suggests that wavefunction correlations, when viewed through the appropriate orbital basis, are predominantly classical. This insight indicates that computational tasks in quantum chemistry could be significantly simplified by employing Natural Orbitals. Consequently, our study underscores the importance of using Natural Orbitals to accurately assess molecular wavefunction correlations and to avoid their overestimation. In summary, our results suggest a promising path for computational simplification in quantum chemistry, advocating for the wider adoption of Natural Orbitals and raising questions about the actual computational complexity of the multi-body problem in quantum chemistry.
Wenqi Jia, Sian Jin, Jinzhen Wang, Wei Niu, Dingwen Tao, Miao Yin
The rapid expansion of computational capabilities and the ever-growing scale of modern HPC systems present formidable challenges in managing exascale scientific data. Faced with such vast datasets, traditional lossless compression techniques prove insufficient in reducing data size to a manageable level while preserving all information intact. In response, researchers have turned to error-bounded lossy compression methods, which offer a balance between data size reduction and information retention. However, despite their utility, these compressors employing conventional techniques struggle with limited reconstruction quality. To address this issue, we draw inspiration from recent advancements in deep learning and propose GWLZ, a novel group-wise learning-based lossy compression framework with multiple lightweight learnable enhancer models. Leveraging a group of neural networks, GWLZ significantly enhances the decompressed data reconstruction quality with negligible impact on the compression efficiency. Experimental results on different fields from the Nyx dataset demonstrate remarkable improvements by GWLZ, achieving up to 20% quality enhancements with negligible overhead as low as 0.0003x.
Yuyan Shi, Jialu Ma, Jin Yang, Shasha Wang, Yichi Zhang
Medical image segmentation plays an important role in many image-guided clinical approaches. However, existing segmentation algorithms mostly rely on the availability of fully annotated images with pixel-wise annotations for training, which can be both labor-intensive and expertise-demanding, especially in the medical imaging domain where only experts can provide reliable and accurate annotations. To alleviate this challenge, there has been a growing focus on developing segmentation methods that can train deep models with weak annotations, such as image-level, bounding boxes, scribbles, and points. The emergence of vision foundation models, notably the Segment Anything Model (SAM), has introduced innovative capabilities for segmentation tasks using weak annotations for promptable segmentation enabled by large-scale pre-training. Adopting foundation models together with traditional learning methods has increasingly gained recent interest research community and shown potential for real-world applications. In this paper, we present a comprehensive survey of recent progress on annotation-efficient learning for medical image segmentation utilizing weak annotations before and in the era of foundation models. Furthermore, we analyze and discuss several challenges of existing approaches, which we believe will provide valuable guidance for shaping the trajectory of foundational models to further advance the field of medical image segmentation.
Fang Guo, Wenyu Li, Honglei Zhuang, Yun Luo, Yafu Li, Le Yan, Qi Zhu, Yue Zhang
The most recent pointwise Large Language Model (LLM) rankers have achieved remarkable ranking results. However, these rankers are hindered by two major drawbacks: (1) they fail to follow a standardized comparison guidance during the ranking process, and (2) they struggle with comprehensive considerations when dealing with complicated passages. To address these shortcomings, we propose to build a ranker that generates ranking scores based on a set of criteria from various perspectives. These criteria are intended to direct each perspective in providing a distinct yet synergistic evaluation. Our research, which examines eight datasets from the BEIR benchmark demonstrates that incorporating this multi-perspective criteria ensemble approach markedly enhanced the performance of pointwise LLM rankers.
Changsuk Oh, Dongseok Shim, Taekbeom Lee, H. Jin Kim
Object removal refers to the process of erasing designated objects from an image while preserving the overall appearance, and it is one area where image inpainting is widely used in real-world applications. The performance of an object remover is quantitatively evaluated by measuring the quality of object removal results, similar to how the performance of an image inpainter is gauged. Current works reporting quantitative performance evaluations utilize original images as references. In this letter, to validate the current evaluation methods cannot properly evaluate the performance of an object remover, we create a dataset with object removal ground truth and compare the evaluations made by the current methods using original images to those utilizing object removal ground truth images. The disparities between two evaluation sets validate that the current methods are not suitable for measuring the performance of an object remover. Additionally, we propose new evaluation methods tailored to gauge the performance of an object remover. The proposed methods evaluate the performance through class-wise object removal results and utilize images without the target class objects as a comparison set. We confirm that the proposed methods can make judgments consistent with human evaluators in the COCO dataset, and that they can produce measurements aligning with those using object removal ground truth in the self-acquired dataset.
Junbiao Pang, Zailin Dong, Jiaxin Deng, Mengyuan Zhu, Yunwei Zhang
Parsing Computer-Aided Design (CAD) drawings is a fundamental step for CAD
revision, semantic-based management, and the generation of 3D prototypes in
both the architecture and engineering industries. Labeling symbols from a CAD
drawing is a challenging yet notorious task from a practical point of view. In
this work, we propose to label and spot symbols from CAD images that are
converted from CAD drawings. The advantage of spotting symbols from CAD images
lies in the low requirement of labelers and the low-cost annotation. However,
pixel-wise spotting symbols from CAD images is challenging work. We propose a
pixel-wise point location via Progressive Gaussian Kernels (PGK) to balance
between training efficiency and location accuracy. Besides, we introduce a
local offset to the heatmap-based point location method. Based on the keypoints
detection, we propose a symbol grouping method to redraw the rectangle symbols
in CAD images. We have released a dataset containing CAD images of equipment
rooms from telecommunication industrial CAD drawings. Extensive experiments on
this real-world dataset show that the proposed method has good generalization
ability.
Authors' comments: 10 pages, 10 figures,6 tables
Yu Li, Han Jiang, Chuanyang Gong, Zhihua Wei
Despite the remarkable achievements of language models (LMs) across a broad spectrum of tasks, their propensity for generating toxic outputs remains a prevalent concern. Current solutions involving finetuning or auxiliary models usually require extensive computational resources, hindering their practicality in large language models (LLMs). In this paper, we propose DeStein, a novel method that detoxifies LMs by applying representation engineering in activation spaces with lower resource and time costs. Specifically, we derive detoxification vectors from self-induced, universal steering pairs through arithmetic operations in activation spaces. During inference, detoxification is achieved by fusing the detoxification vectors with the original representations in a head-wise manner. Empirical results demonstrate that our method significantly outperforms previous state-of-the-art approaches on various metrics, while also maintaining satisfactory generation quality and diversity. We further validate the practicality and scalability of DeStein with a series of white-box LLMs. The method is open-sourced at https://github.com/LizLizLi/DeStein. Warning: Some example model outputs may contain highly offensive or disturbing text.
Raoul Prisant, Federica Garin, Paolo Frasca
In this paper we make use of graphon theory to study opinion dynamics on
large undirected networks. The opinion dynamics models that we take into
consideration allow for negative interactions between the individuals, i.e.
competing entities whose opinions can grow apart. We consider both the
repelling model and the opposing model that are studied in the literature. We
define the repelling and the opposing dynamics on graphons and we show that
their initial value problem's solutions exist and are unique. We then show that
the graphon dynamics well approximate the dynamics on large graphs that
converge to a graphon. This result applies to large random graphs that are
sampled according to a graphon. All these facts are illustrated in an extended
numerical example.
Authors' comments: 8 double-column pages. This revised version corrects several typos.
An abridged version is going to appear in the proceedings of the 2024 IEEE
Conference on Decision and Control
Jiing-Ping Wang, Ming-Guang Lin, An-Yeu, Wu
With the rise of Transformer models in NLP and CV domain, Multi-Head Attention has been proven to be a game-changer. However, its expensive computation poses challenges to the model throughput and efficiency, especially for the long sequence tasks. Exploiting the sparsity in attention has been proven to be an effective way to reduce computation. Nevertheless, prior works do not consider the various distributions among different heads and lack a systematic method to determine the threshold. To address these challenges, we propose Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer (LATTE). LATTE employs a headwise threshold-based filter with the low-precision dot product and computation reuse mechanism to reduce the computation of MHA. Moreover, the trainable threshold is introduced to provide a systematic method for adjusting the thresholds and enable end-to-end optimization. Experimental results indicate LATTE can smoothly adapt to both NLP and CV tasks, offering significant computation savings with only a minor compromise in performance. Also, the trainable threshold is shown to be essential for the leverage between the performance and the computation. As a result, LATTE filters up to 85.16% keys with only a 0.87% accuracy drop in the CV task and 89.91% keys with a 0.86 perplexity increase in the NLP task.
Nooshin Yousefzadeh, Rahul Sengupta, Yashaswi Karnati, Anand Rangarajan, Sanjay Ranka
Traffic congestion has significant economic, environmental, and social
ramifications. Intersection traffic flow dynamics are influenced by numerous
factors. While microscopic traffic simulators are valuable tools, they are
computationally intensive and challenging to calibrate. Moreover, existing
machine-learning approaches struggle to provide lane-specific waveforms or
adapt to intersection topology and traffic patterns. In this study, we propose
two efficient and accurate "Digital Twin" models for intersections, leveraging
Graph Attention Neural Networks (GAT). These attentional graph auto-encoder
digital twins capture temporal, spatial, and contextual aspects of traffic
within intersections, incorporating various influential factors such as
high-resolution loop detector waveforms, signal state records, driving
behaviors, and turning-movement counts. Trained on diverse counterfactual
scenarios across multiple intersections, our models generalize well, enabling
the estimation of detailed traffic waveforms for any intersection approach and
exit lanes. Multi-scale error metrics demonstrate that our models perform
comparably to microsimulations. The primary application of our study lies in
traffic signal optimization, a pivotal area in transportation systems research.
These lightweight digital twins can seamlessly integrate into corridor and
network signal timing optimization frameworks. Furthermore, our study's
applications extend to lane reconfiguration, driving behavior analysis, and
facilitating informed decisions regarding intersection safety and efficiency
enhancements. A promising avenue for future research involves extending this
approach to urban freeway corridors and integrating it with measures of
effectiveness metrics.
Authors' comments: T-TIS Journal, 12 pages, 8 figures, 4 tables
Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, Alireza Sahebi, Doratossadat Dastgheib, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Evaluating Large Language Models (LLMs) is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language. Distinctive features of the Khayyam Challenge are (i) its comprehensive coverage of various topics, including literary comprehension, mathematics, sciences, logic, intelligence testing, etc., aimed at assessing different facets of LLMs such as language comprehension, reasoning, and information retrieval across various educational stages, from lower primary school to upper secondary school (ii) its inclusion of rich metadata such as human response rates, difficulty levels, and descriptive answers (iii) its utilization of new data to avoid data contamination issues prevalent in existing frameworks (iv) its use of original, non-translated data tailored for Persian speakers, ensuring the framework is free from translation challenges and errors while encompassing cultural nuances (v) its inherent scalability for future data updates and evaluations without requiring special human effort. Previous works lacked an evaluation framework that combined all of these features into a single comprehensive benchmark. Furthermore, we evaluate a wide range of existing LLMs that support the Persian language, with statistical analyses and interpretations of their outputs.
Zihao Wang, Bin Cui, Shaoduo Gan
Optimizing the Key-Value (KV) cache of the Large Language Model (LLM) has been considered critical to saving the cost of inference. Most of the existing KV-cache compression algorithms attempted to sparsify the sequence of tokens by taking advantage of the different importance of tokens. However, most of these methods treat all layers equally, allocating the same KV budget to each layer. This approach is suboptimal, as some layers may be less sensitive to input tokens yet still receive the same budget as others. In this work, we found that by identifying the importance of attention layers, we could optimize the KV-cache jointly from two dimensions, i.e., sequence-wise and layer-wise. Based on our observations regarding layer-wise importance in inference, we propose SqueezeAttention to precisely optimize the allocation of KV-cache budget among layers on-the-fly and then incorporate three representative sequence-wise algorithms to compress the KV-cache for each layer with its very own budget. Specifically, we first measure each layer's importance by calculating the cosine similarity of the input prompt differences before and after the self-attention layers. Based on this similarity, we then categorize the layers into two groups and adjust their KV budgets accordingly. By optimizing the KV-cache from both sequence's and layer's dimensions, SqueezeAttention achieves around 30% to 70% of the memory reductions and up to 2.2 times of throughput improvements in a wide range of LLMs and benchmarks. The code is available at https://github.com/hetailang/SqueezeAttention.
Mo Kordzanganeh, Danial Keshvary, Nariman Arian
Latent diffusion models are the state-of-the-art for synthetic image
generation. To align these models with human preferences, training the models
using reinforcement learning on human feedback is crucial. Black et. al 2024
introduced denoising diffusion policy optimisation (DDPO), which accounts for
the iterative denoising nature of the generation by modelling it as a Markov
chain with a final reward. As the reward is a single value that determines the
model's performance on the entire image, the model has to navigate a very
sparse reward landscape and so requires a large sample count. In this work, we
extend the DDPO by presenting the Pixel-wise Policy Optimisation (PXPO)
algorithm, which can take feedback for each pixel, providing a more nuanced
reward to the model.
Authors' comments: 6 pages, 7 figures
Toshiyuki Mizuki, Munetake Momose, Masataka Aizawa, Hiroshi Kobayashi
More than a thousand warm debris disks have been detected as infrared excess
at mid-infrared wavelengths, and their frequencies have been obtained for
various spectral types of stars. However, the dependence of the frequencies on
spectral type is still debated because the number of stars with significant and
detectable infrared excess is limited. Herein, we present the largest
systematic search for infrared excess using data from Gaia, WISE, and Spitzer.
We identified 373, 485, and 255-reliable infrared excesses in the mid-infrared
archival data at wavelengths of 12, 22, and 24 $\mu$m for WISE/$W3$, $W4$, and
Spitzer/MIPS ch1, respectively. Although we confirmed that more massive stars
tend to show higher frequencies of debris disks, these disk frequencies are
relatively flat for both low- and intermediate-mass stars, with a jump at 7000
K for all three wavelengths. Assuming that bright, warm debris disks have
lifetimes of a few to several hundred million years, the disk frequency can be
understood as the ratio between the timescale and the upper limits of the
sample ages. We also found that intermediate-mass stars with infrared excess
tend to be bluer and fainter along the evolutionary track than those without,
implying that massive stars hosting debris disks are relatively young, with an
isochronal age of approximately 500 Myr. These tendencies are reasonably
explained by a standard scenario in which debris disks are likely to be
produced by collisions of planetesimals in early stages of stellar evolution,
such as the Late Heavy Bombardment.
Authors' comments: Accepted for publication in AJ. 27 pages, 19 figures, 5 tables