Zhihao Shuai, Boyan Li, Siyu Yan, Yuyu Luo, Weikai Yang
Although data visualization is powerful for revealing patterns and communicating insights, creating effective visualizations requires familiarity with authoring tools and often disrupts the analysis flow. While large language models show promise for automatically converting analysis intent into visualizations, existing methods function as black boxes without transparent reasoning processes, which prevents users from understanding design rationales and refining suboptimal outputs. To bridge this gap, we propose integrating Chain-of-Thought (CoT) reasoning into the Natural Language to Visualization (NL2VIS) pipeline. First, we design a comprehensive CoT reasoning process for NL2VIS and develop an automatic pipeline to equip existing datasets with structured reasoning steps. Second, we introduce nvBench-CoT, a specialized dataset capturing detailed step-by-step reasoning from ambiguous natural language descriptions to finalized visualizations, which enables state-of-the-art performance when used for model fine-tuning. Third, we develop DeepVIS, an interactive visual interface that tightly integrates with the CoT reasoning process, allowing users to inspect reasoning steps, identify errors, and make targeted adjustments to improve visualization outcomes. Quantitative benchmark evaluations, two use cases, and a user study collectively demonstrate that our CoT framework effectively enhances NL2VIS quality while providing insightful reasoning steps to users.
Zihan Fang, Zhiyong Xu, Lan Du, Shide Du, Zhiling Cai, Shiping Wang
Existing multi-view learning models struggle in open-set scenarios due to their implicit assumption of class completeness. Moreover, static view-induced biases, which arise from spurious view-label associations formed during training, further degrade their ability to recognize unknown categories. In this paper, we propose a multi-view open-set learning framework via ambiguity uncertainty calibration and view-wise debiasing. To simulate ambiguous samples, we design O-Mix, a novel synthesis strategy to generate virtual samples with calibrated open-set ambiguity uncertainty. These samples are further processed by an auxiliary ambiguity perception network that captures atypical patterns for improved open-set adaptation. Furthermore, we incorporate an HSIC-based contrastive debiasing module that enforces independence between view-specific ambiguous and view-consistent representations, encouraging the model to learn generalizable features. Extensive experiments on diverse multi-view benchmarks demonstrate that the proposed framework consistently enhances unknown-class recognition while preserving strong closed-set performance.
Marc Hölle, Walter Kellermann, Vasileios Belagiannis
Semantic segmentation models trained on known object classes often fail in
real-world autonomous driving scenarios by confidently misclassifying unknown
objects. While pixel-wise out-of-distribution detection can identify unknown
objects, existing methods struggle in complex scenes where rare object classes
are often confused with truly unknown objects. We introduce an
uncertainty-aware likelihood ratio estimation method that addresses these
limitations. Our approach uses an evidential classifier within a likelihood
ratio test to distinguish between known and unknown pixel features from a
semantic segmentation model, while explicitly accounting for uncertainty.
Instead of producing point estimates, our method outputs probability
distributions that capture uncertainty from both rare training examples and
imperfect synthetic outliers. We show that by incorporating uncertainty in this
way, outlier exposure can be leveraged more effectively. Evaluated on five
standard benchmark datasets, our method achieves the lowest average false
positive rate (2.5%) among state-of-the-art while maintaining high average
precision (90.91%) and incurring only negligible computational overhead. Code
is available at https://github.com/glasbruch/ULRE.
Authors' comments: Accepted at ICCVW 2025, 11 pages, 4 figures
Yuan-Cheng Yu, Yen-Chieh Ouyang, Chun-An Lin
Time-series anomaly detection plays a central role across a wide range of
application domains. With the increasing proliferation of the Internet of
Things (IoT) and smart manufacturing, time-series data has dramatically
increased in both scale and dimensionality. This growth has exposed the
limitations of traditional statistical methods in handling the high
heterogeneity and complexity of such data. Inspired by the recent success of
large language models (LLMs) in multimodal tasks across language and vision
domains, we propose a novel unsupervised anomaly detection framework: A
Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly
Detection (TriP-LLM). TriP-LLM integrates local and global temporal features
through a tri-branch design-Patching, Selection, and Global-to encode the input
time series into patch-wise tokens, which are then processed by a frozen,
pretrained LLM. A lightweight patch-wise decoder reconstructs the input, from
which anomaly scores are derived. We evaluate TriP-LLM on several public
benchmark datasets using PATE, a recently proposed threshold-free evaluation
metric, and conduct all comparisons within a unified open-source framework to
ensure fairness. Experimental results show that TriP-LLM consistently
outperforms recent state-of-the-art methods across all datasets, demonstrating
strong detection capabilities. Furthermore, through extensive ablation studies,
we verify the substantial contribution of the LLM to the overall architecture.
Compared to LLM-based approaches using Channel Independence (CI) patch
processing, TriP-LLM achieves significantly lower memory consumption, making it
more suitable for GPU memory-constrained environments. All code and model
checkpoints are publicly available on https://github.com/YYZStart/TriP-LLM.git
Authors' comments: 11 pages, 2 figures
Zhihui Guo, Xin Man, Hui Xu, Jie Shao
Multimodal Large Language Models (MLLMs) excel in vision-language tasks such as image captioning but remain prone to object hallucinations, where they describe objects that do not appear in the image. To mitigate this, we propose \textbf{LISA}, a \textbf{L}ayer-wise \textbf{I}ntegration and \textbf{S}uppression \textbf{A}pproach that enhances generation consistency through hierarchical modulation and multi-layer fusion. LISA leverages the functional hierarchy within MLLMs, where shallow layers provide visual grounding, middle layers encode semantics, and deep layers tend to amplify spurious signals. First, zone-specific spectral modulation stabilizes attention by suppressing over-amplified activations in deeper layers while preserving alignment cues in earlier layers. Second, token-level logits from selected layers are fused via anchor-based routing, with token-wise anchor selection and soft logit fusion enabling adaptive integration during decoding. LISA is fully \textbf{plug-and-play} and can be seamlessly integrated into existing MLLMs, including Qwen2.5-VL. Experiments on multiple benchmarks show that LISA reduces hallucinations by up to 53.6\% in $\mathrm{CHAIR}_I$ and improves POPE F1 by 4.5\%, demonstrating strong generalization across models and tasks.
Guangteng Liu, Xiayue Liu, Zhixiang Xu, Yufeng Yuan, Hui Zhao, Yuxuan Liu, Yufei Jiang
Wi-Fi sensing offers a promising technique for contactless human respiration
monitoring. A key challenge, however, is the blind spot problem caused by
random phase offsets that corrupt the complementarity of respiratory signals.
To address the challenge, we propose a single-antenna-Wi-Fi-sensing
(SA-WiSense) framework to improve accuracy of human respiration monitoring,
robust against random phase offsets. The proposed SA-WiSense framework is
cost-efficient, as only a single antenna is used rather than multiple antennas
as in the previous works. Therefore, the proposed framework is applicable to
Internet of Thing (IoT), where most of sensors are equipped with a single
antenna. On one hand, we propose a cross-subcarrier channel state information
(CSI) ratio (CSCR) based blind spot mitigation approach for IoT, where the
ratios of two values of CSI between subcarriers are leveraged to mitigate
random phase offsets. We prove that the random phase offsets can be cancelled
by the proposed CSCR approach, thereby restoring the inherent complementarity
of signals for blind-spot-free sensing. On the other hand, we propose a genetic
algorithm (GA) based subcarrier selection (GASS) approach by formulating an
optimization problem in terms of the sensing-signal-to-noise ratio (SSNR) of
CSCR between subcarriers. GA is utilized to solve the formulated optimization
problem. We use commodity ESP32 microcontrollers to build an experiment test.
The proposed works are validated to achieve an detection rate of 91.2% for
respiration monitoring at distances up to 8.0 meters, substantially more
accurate than the state-of-the-art methods with a single antenna.
Authors' comments: 12pages, 10figures
Minnan Pei, Gang Li, Junwen Si, Zeyu Zhu, Zitao Mo, Peisong Wang, Zhuoran Song, Xiaoyao Liang et al.
3D Gaussian Splatting (3DGS) has emerged as a leading neural rendering technique for high-fidelity view synthesis, prompting the development of dedicated 3DGS accelerators for mobile applications. Through in-depth analysis, we identify two major limitations in the conventional decoupled preprocessing-rendering dataflow adopted by existing accelerators: 1) a significant portion of preprocessed Gaussians are not used in rendering, and 2) the same Gaussian gets repeatedly loaded across different tile renderings, resulting in substantial computational and data movement overhead. To address these issues, we propose GCC, a novel accelerator designed for fast and energy-efficient 3DGS inference. At the dataflow level, GCC introduces: 1) cross-stage conditional processing, which interleaves preprocessing and rendering to dynamically skip unnecessary Gaussian preprocessing; and 2) Gaussian-wise rendering, ensuring that all rendering operations for a given Gaussian are completed before moving to the next, thereby eliminating duplicated Gaussian loading. We also propose an alpha-based boundary identification method to derive compact and accurate Gaussian regions, thereby reducing rendering costs. We implement our GCC accelerator in 28nm technology. Extensive experiments demonstrate that GCC significantly outperforms the state-of-the-art 3DGS inference accelerator, GSCore, in both performance and energy efficiency.
Enze Zhou, Wenjian Li, Wenting Xu, Yuwei Lu, Shangbin Chen, Shaoyang Wang, Gang Zheng, Tianwu Xie et al.
Photon-counting computed tomography (PCCT) has demonstrated significant
advancements in recent years; however, pixel-wise detector response
nonuniformity remains a key challenge, frequently manifesting as ring artifacts
in reconstructed images. Existing correction methods exhibit limited
generalizability in complex multi-material scenarios, such as contrast-enhanced
imaging. This study introduces a Signal-to-Uniformity Error Polynomial
Calibration (STEPC) framework to address this issue. STEPC first fits
multi-energy projections using a 2D polynomial surface to generate ideal
references, then applies a nonlinear multi-energy polynomial model to predict
and correct pixel-wise nonuniformity errors. The model is calibrated using
homogeneous slab phantoms of different materials, including PMMA, aluminum, and
iodinated contrast agents, enabling correction for both non-contrast and
contrast-enhanced imaging. Experiments were performed on a custom Micro-PCCT
system with phantoms and mouse. Correction performance of STEPC was evaluated
using the mean local standard deviation (MLSD) in the projection domain and the
ring artifact deviation (RAD) on the reconstructed images. STEPC consistently
outperformed existing correction methods in both non-contrast and
contrast-enhanced scenarios. It achieved the lowest MLSD and RAD for both
phantoms and mouse scans. These results indicate that STEPC provides a robust
and practical solution for correcting detector nonuniformity in multi-material
PCCT imaging, witch position it as a promising general-purpose calibration
framework for photon-counting CT systems.
Authors' comments: 10 pages, 12 figures. Submitted to IEEE Transactions on Medical
Imaging
Judy Long, Tao Liu, Sean Alexander Woznicki, Miljana Marković, Oskar Marko, Molly Sears
Crop mapping involves identifying and classifying crop types using spatial
data, primarily derived from remote sensing imagery. This study presents the
first comprehensive review of large-scale, pixel-wise crop mapping workflows,
encompassing both conventional supervised methods and emerging transfer
learning approaches. To identify the optimal supervised crop mapping workflows,
we conducted systematic experiments, comparing six widely adopted satellite
image-based preprocessing methods, alongside eleven supervised pixel-wise
classification models. Additionally, we assessed the synergistic impact of
varied training sample sizes and variable combinations. Moreover, we identified
optimal transfer learning techniques for different magnitudes of domain shift.
The evaluation of best methods was conducted across five diverse agricultural
sites. Landsat 8 served as the primary satellite data source. Labels come from
CDL trusted pixels and field surveys.
Our findings reveal three key insights. First, fine-scale interval
preprocessing paired with Transformer models consistently delivered optimal
performance for both supervised and transferable workflows. RF offered rapid
training and competitive performance in conventional supervised learning and
direct transfer to similar domains. Second, transfer learning techniques
enhanced workflow adaptability, with UDA being effective for homogeneous crop
classes while fine-tuning remains robust across diverse scenarios. Finally,
workflow choice depends heavily on the availability of labeled samples. With a
sufficient sample size, supervised training typically delivers more accurate
and generalizable results. Below a certain threshold, transfer learning that
matches the level of domain shift is a viable alternative to achieve crop
mapping. Repository:
Best-Practices-for-Large-Scale-Pixel-Wise-Crop-Mapping-and-Transfer-Learning-Workflows
Authors' comments: A review article. 41 pages, 22 figures. Preprint
Malavika Vasist, Paul Mollire, Helena Kühnle, Olivier Absil, Gilles Louppe, Rens Waters, Manuel Güdel, Thomas Henning et al.
Cold brown dwarf atmospheres are good training grounds for analyzing temperate giant planets. WISEP J173835.52+273258.9 (WISE 1738) is an isolated Y0 brown dwarf with a temperature between 350-400 K, at the T-Y transition. While its near-infrared spectrum has been studied, bulk properties and chemistry remain uncertain. We analyze new JWST MIRI medium-resolution spectra (5-18 micron), combined with near-infrared spectra (0.98-2.2 micron) from HST/WFC3 and Gemini/GNIRS, to better constrain WISE 1738's atmosphere and physical parameters. We use Neural Posterior Estimation (NPE) with a cloud-free petitRADTRANS model and evaluate results using posterior checks, coverage, and L-C2ST diagnostics. Our retrieval confirms previous constraints on H2O, CH4, and NH3, and for the first time constrains CO, CO2, and 15NH3. We find evidence of disequilibrium chemistry through CO and CO2 abundances not expected under equilibrium. Estimated properties are temperature 402 (+12,-9) K, log g 4.43 (+0.26,-0.34) cm/s2, mass 13 (+11,-7) M_Jup, radius 1.14 (+0.03,-0.03) R_Jup, and bolometric luminosity -6.52 (+0.05,-0.04) log L/L_sun. Evolutionary models suggest an age between 1 and 4 Gyr, consistent with a 6-hour rotation. We place an upper bound on 15NH3, implying a 3-sigma lower limit on the 14N/15N ratio of 275. We also derive a C/O ratio of 1.35 (+0.39,-0.31) and metallicity of 0.34 (+0.12,-0.11), without accounting for oxygen sequestration.
Osama Hardan, Omar Elshenhabi, Tamer Khattab, Mohamed Mabrok
Vision Mamba models promise transformer-level performance at linear computational cost, but their reliance on serializing 2D images into 1D sequences introduces a critical, yet overlooked, design choice: the patch scan order. In medical imaging, where modalities like brain MRI contain strong anatomical priors, this choice is non-trivial. This paper presents the first systematic study of how scan order impacts MRI segmentation. We introduce Multi-Scan 2D (MS2D), a parameter-free module for Mamba-based architectures that facilitates exploring diverse scan paths without additional computational cost. We conduct a large-scale benchmark of 21 scan strategies on three public datasets (BraTS 2020, ISLES 2022, LGG), covering over 70,000 slices. Our analysis shows conclusively that scan order is a statistically significant factor (Friedman test: $Ï^{2}_{20}=43.9, p=0.0016$), with performance varying by as much as 27 Dice points. Spatially contiguous paths -- simple horizontal and vertical rasters -- consistently outperform disjointed diagonal scans. We conclude that scan order is a powerful, cost-free hyperparameter, and provide an evidence-based shortlist of optimal paths to maximize the performance of Mamba models in medical imaging.
Authors' comments: Submitted to the 2025 IEEE International Conference on Future Machine Learning and Data Science (FMLDS)
Casey Wall, Longwei Wang, Rodrigue Rizk, KC Santosh
Interpreting the decision-making process of Convolutional Neural Networks
(CNNs) is critical for deploying models in high-stakes domains.
Gradient-weighted Class Activation Mapping (Grad-CAM) is a widely used method
for visual explanations, yet it typically focuses on the final convolutional
layer or na\"ively averages across layers, strategies that can obscure
important semantic cues or amplify irrelevant noise. We propose Winsor-CAM, a
novel, human-tunable extension of Grad-CAM that generates robust and coherent
saliency maps by aggregating information across all convolutional layers. To
mitigate the influence of noisy or extreme attribution values, Winsor-CAM
applies Winsorization, a percentile-based outlier attenuation technique. A
user-controllable threshold allows for semantic-level tuning, enabling flexible
exploration of model behavior across representational hierarchies. Evaluations
on standard architectures (ResNet50, DenseNet121, VGG16, InceptionV3) using the
PASCAL VOC 2012 dataset demonstrate that Winsor-CAM produces more interpretable
heatmaps and achieves superior performance in localization metrics, including
intersection-over-union and center-of-mass alignment, when compared to Grad-CAM
and uniform layer-averaging baselines. Winsor-CAM advances the goal of
trustworthy AI by offering interpretable, multi-layer insights with
human-in-the-loop control.
Authors' comments: 15 pages, 10 figures, 7 tables. Submitted to IEEE Transactions on
Pattern Analysis and Machine Intelligence
Zihe Yan, Zhuosheng Zhang
Graphical user interface (GUI) agents built on multimodal large language models (MLLMs) have recently demonstrated strong decision-making abilities in screen-based interaction tasks. However, they remain highly vulnerable to pop-up-based environmental injection attacks, where malicious visual elements divert model attention and lead to unsafe or incorrect actions. Existing defense methods either require costly retraining or perform poorly under inductive interference. In this work, we systematically study how such attacks alter the attention behavior of GUI agents and uncover a layer-wise attention divergence pattern between correct and incorrect outputs. Based on this insight, we propose \textbf{LaSM}, a \textit{Layer-wise Scaling Mechanism} that selectively amplifies attention and MLP modules in critical layers. LaSM improves the alignment between model saliency and task-relevant regions without additional training. Extensive experiments across 12 types of pop-up perturbations and 4 different model backbones show that LaSM consistently enhances the defense success rate. When combined with prompt-level alerts, LaSM achieves over 98\% robustness even under strong inductive attacks. Our findings reveal that attention misalignment is a core vulnerability in MLLM agents and can be effectively addressed through selective layer-wise modulation.
Authors' comments: 10 pages, 9 figures
Modan Tailleur, Mathieu Lagrange, Pierre Aumond, Vincent Tourre
Environmental sound recordings often contain intelligible speech, raising privacy concerns that limit analysis, sharing and reuse of data. In this paper, we introduce a method that renders speech unintelligible while preserving both the integrity of the acoustic scene, and the overall audio quality. Our approach involves reversing waveform segments to distort speech content. This process is enhanced through a voice activity detection and speech separation pipeline, which allows for more precise targeting of speech. In order to demonstrate the effectivness of the proposed approach, we consider a three-part evaluation protocol that assesses: 1) speech intelligibility using Word Error Rate (WER), 2) sound sources detectability using Sound source Classification Accuracy-Drop (SCAD) from a widely used pre-trained model, and 3) audio quality using the Fr\'echet Audio Distance (FAD), computed with our reference dataset that contains unaltered speech. Experiments on this simulated evaluation dataset, which consists of linear mixtures of speech and environmental sound scenes, show that our method achieves satisfactory speech intelligibility reduction (97.9% WER), minimal degradation of the sound sources detectability (2.7% SCAD), and high perceptual quality (FAD of 1.40). An ablation study further highlights the contribution of each component of the pipeline. We also show that incorporating random splicing to our speech content privacy enforcement method can enhance the algorithm's robustness to attempt to recover the clean speech, at a slight cost of audio quality.
A. Bochkov
The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal "docking port," enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is "grown" by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.
Woonsang Kang, Joohyung Lee, Seungjun Kim, Jungchan Cho, Yoonseon Oh
Grasp pose detection (GPD) is a fundamental capability for robotic autonomy,
but its reliance on large, diverse datasets creates significant data privacy
and centralization challenges. Federated Learning (FL) offers a
privacy-preserving solution, but its application to GPD is hindered by the
substantial communication overhead of large models, a key issue for
resource-constrained robots. To address this, we propose a novel module-wise FL
framework that begins by analyzing the learning dynamics of the GPD model's
functional components. This analysis identifies slower-converging modules, to
which our framework then allocates additional communication effort. This is
realized through a two-phase process: a standard full-model training phase is
followed by a communication-efficient phase where only the identified subset of
slower-converging modules is trained and their partial updates are aggregated.
Extensive experiments on the GraspNet-1B dataset demonstrate that our method
outperforms standard FedAvg and other baselines, achieving higher accuracy for
a given communication budget. Furthermore, real-world experiments on a physical
robot validate our approach, showing a superior grasp success rate compared to
baseline methods in cluttered scenes. Our work presents a
communication-efficient framework for training robust, generalized GPD models
in a decentralized manner, effectively improving the trade-off between
communication cost and model performance.
Authors' comments: 8 pages, 5 figures. Submitted to IEEE Robotics and Automation Letters
(RA-L)
Jiawei Sun, Hongkang Li, Meng Wang
Jumping connections enable Graph Convolutional Networks (GCNs) to overcome
over-smoothing, while graph sparsification reduces computational demands by
selecting a sub-matrix of the graph adjacency matrix during neighborhood
aggregation. Learning GCNs with graph sparsification has shown empirical
success across various applications, but a theoretical understanding of the
generalization guarantees remains limited, with existing analyses ignoring
either graph sparsification or jumping connections. This paper presents the
first learning dynamics and generalization analysis of GCNs with jumping
connections using graph sparsification. Our analysis demonstrates that the
generalization accuracy of the learned model closely approximates the highest
achievable accuracy within a broad class of target functions dependent on the
proposed sparse effective adjacency matrix $A^*$. Thus, graph sparsification
maintains generalization performance when $A^*$ preserves the essential edges
that support meaningful message propagation. We reveal that jumping connections
lead to different sparsification requirements across layers. In a
two-hidden-layer GCN, the generalization is more affected by the sparsified
matrix deviations from $A^*$ of the first layer than the second layer. To the
best of our knowledge, this marks the first theoretical characterization of
jumping connections' role in sparsification requirements. We validate our
theoretical results on benchmark datasets in deep GCNs.
Authors' comments: TMLR
Yan Dong, Enci Xu, Shaoqiang Qiu, Wenxuan Li, Yang Liu, Bin Han
High-speed ground robots moving on unstructured terrains generate intense
high-frequency vibrations, leading to LiDAR scan distortions in Lidar-inertial
odometry (LIO). Accurate and efficient undistortion is extremely challenging
due to (1) rapid and non-smooth state changes during intense vibrations and (2)
unpredictable IMU noise coupled with a limited IMU sampling frequency. To
address this issue, this paper introduces post-undistortion uncertainty. First,
we model the undistortion errors caused by linear and angular vibrations and
assign post-undistortion uncertainty to each point. We then leverage this
uncertainty to guide point-to-map matching, compute uncertainty-aware
residuals, and update the odometry states using an iterated Kalman filter. We
conduct vibration-platform and mobile-platform experiments on multiple public
datasets as well as our own recordings, demonstrating that our method achieves
better performance than other methods when LiDAR undergoes intense vibration.
Authors' comments: 8 pages, 10 figures, 5 tables. Accepted by Robotics and Automation
Letters at June 30
Lorenzo Giaretto, Nicola Soave
In this paper we establish existence and properties of minimal energy solutions for the weakly coupled system $$ \begin{cases}
-Îu_i + λ_i u_i = μ_i|u_i|^{Kq-2}u_i + β|u_i|^{q-2}u_i\prod_{j\neq i}|u_j|^q & \text{in }\mathbb{R}^d, \qquad
u_i \in H^1(\mathbb{R}^d), \end{cases}\qquad i=1,\dots, K, $$ characterized by $K$-wise interaction (namely the interaction term involves the product of all the components). We consider both attractive ($β>0$) and repulsive cases ($β<0$), and we give sufficient conditions on $β$ in order to have least energy fully non-trivial solutions, if necessary under a radial constraint. We also study the asymptotic behavior of least energy fully non-trivial radial solutions in the limit of strong competition $β\to -\infty$, showing partial segregation phenomena which differ substantially from those arising in pairwise interaction models.
Authors' comments: 26 pages, no figures
Taehoon Kim, Jongwook Choi, Yonghyun Jeong, Haeun Noh, Jaejun Yoo, Seungryul Baek, Jongwon Choi
We introduce a deepfake video detection approach that exploits pixel-wise
temporal inconsistencies, which traditional spatial frequency-based detectors
often overlook. Traditional detectors represent temporal information merely by
stacking spatial frequency spectra across frames, resulting in the failure to
detect temporal artifacts in the pixel plane. Our approach performs a 1D
Fourier transform on the time axis for each pixel, extracting features highly
sensitive to temporal inconsistencies, especially in areas prone to unnatural
movements. To precisely locate regions containing the temporal artifacts, we
introduce an attention proposal module trained in an end-to-end manner.
Additionally, our joint transformer module effectively integrates pixel-wise
temporal frequency features with spatio-temporal context features, expanding
the range of detectable forgery artifacts. Our framework represents a
significant advancement in deepfake video detection, providing robust
performance across diverse and challenging detection scenarios.
Authors' comments: accepted by iccv 2025. code is will be available at
https://github.com/rama0126/PwTF-DVD