Shinyu Kim, Minjin Kim, Suyeon Son, Luis C. Ho
We assess the systematics and efficiency of an AGN selection method based on mid-infrared (MIR) variability. To this end, we utilize various types of active and inactive galaxies from the Sloan Digital Sky Survey, matching them with multi-epoch photometric data from the NEOWISE mission. Using W1 and W2 band light curves with a $\sim10$-year baseline, we find that combining the likelihood of deviation from non-variability with the correlation coefficient between the W1 and W2 bands reliably identifies AGNs. Specifically, this MIR-based method recovers $\sim 28.2\%$ of optically selected AGNs. Applying the same technique to inactive galaxies, we identify AGN candidates at fractions ranging from $0.4$ to $11.8\%$, indicating that MIR variability allows us to detect AGN candidates even in optically inactive hosts. While some variable sources exhibit transient-like light curves, possibly originating from tidal disruption events or supernovae, their contribution to the total variable population is less than a few percent, indicating a minimal impact on our results. Across all subsamples, the AGN fraction marginally increases with star formation activity, implying coordinated evolution between central black hole growth and star formation. Finally, the AGN fraction inferred from our method drops dramatically in classical LINERs, consistent with their low accretion rates and absence of a dusty torus.
Authors' comments: 17 pages, 13 figures; Accepted for publication in ApJ
Kai Göbel, Pierrick Lorang, Patrik Zips, Tobias Glück
Task planning, the problem of sequencing actions to reach a goal from an initial state, is a core capability requirement for autonomous robotic systems. Whether large language models (LLMs) can serve as viable planners alongside classical symbolic methods remains an open question. We present PyPDDLEngine, an open-source Planning Domain Definition Language (PDDL) simulation engine that exposes planning operations as LLM tool calls through a Model Context Protocol (MCP) interface. Rather than committing to a complete action sequence upfront, the LLM acts as an interactive search policy that selects one action at a time, observes each resulting state, and can reset and retry. We evaluate four approaches on 102 International Planning Competition (IPC) Blocksworld instances under a uniform 180-second budget: Fast Downward lama-first and seq-sat-lama-2011 as classical baselines, direct LLM planning (Claude Haiku 4.5), and agentic LLM planning via PyPDDLEngine. Fast Downward achieves 85.3% success. The direct and agentic LLM approaches achieve 63.7% and 66.7%, respectively, a consistent but modest three-percentage-point advantage for the agentic approach at $5.7\times$ higher token cost per solution. Across most co-solved difficulty blocks, both LLM approaches produce shorter plans than seq-sat-lama-2011 despite its iterative quality improvement, a result consistent with training-data recall rather than generalisable planning. These results suggest that agentic gains depend on the nature of environmental feedback. Coding agents benefit from externally grounded signals such as compiler errors and test failures, whereas PDDL step feedback is self-assessed, leaving the agent to evaluate its own progress without external verification.
Peng Shurui, Xin Lin, Shi Luo, Jincen Ou, Dizhe Zhang, Lu Qi, Truong Nguyen, Chao Ren
Image restoration under diverse degradations remains challenging for unified all-in-one frameworks due to feature interference and insufficient expert specialization. We propose SLER-IR, a spherical layer-wise expert routing framework that dynamically activates specialized experts across network layers. To ensure reliable routing, we introduce a Spherical Uniform Degradation Embedding with contrastive learning, which maps degradation representations onto a hypersphere to eliminate geometry bias in linear embedding spaces. In addition, a Global-Local Granularity Fusion (GLGF) module integrates global semantics and local degradation cues to address spatially non-uniform degradations and the train-test granularity gap. Experiments on three-task and five-task benchmarks demonstrate that SLER-IR achieves consistent improvements over state-of-the-art methods in both PSNR and SSIM. Code and models will be publicly released.
Ameya Markale, Luise Brock, Ihor Horishnyi, Dominika Skwierawska, Tri-Thien Nguyen, Hannes Schreiter, Shirin Heidarikahkesh, Lorenz A. Kapsner et al.
Diffusion-weighted imaging (DWI) can support lesion detection and characterization in breast magnetic resonance imaging (MRI), however especially high b-value diffusion-weighted acquisitions can be prone to intensity artifacts that can affect diagnostic image assessment. This study aims to detect both hyper- and hypointense artifacts on high b-value diffusion-weighted images (b=1500 s/mm2) using deep learning, employing either a binary classification (artifact presence) or a multiclass classification (artifact intensity) approach on a slice-wise dataset.This IRB-approved retrospective study used the single-center dataset comprising n=11806 slices from routine 3T breast MRI examinations performed between 2022 and mid-2023. Three convolutional neural network (CNN) architectures (DenseNet121, ResNet18, and SEResNet50) were trained for binary classification of hyper- and hypointense artifacts. The best performing model (DenseNet121) was applied to an independent holdout test set and was further trained separately for multiclass classification. Evaluation included area under receiver operating characteristic curve (AUROC), area under precision recall curve (AUPRC), precision, and recall, as well as analysis of predicted bounding box positions, derived from the network Grad-CAM heatmaps. DenseNet121 achieved AUROCs of 0.92 and 0.94 for hyper- and hypointense artifact detection, respectively, and weighted AUROCs of 0.85 and 0.88 for multiclass classification on single-slice high b-value diffusion-weighted images. A radiologist evaluated bounding box precision on a 1-5 Likert-like scale across 200 slices, achieving mean scores of 3.33+-1.04 for hyperintense artifacts and 2.62+-0.81 for hypointense artifacts. Hyper- and hypointense artifact detection in slice-wise breast DWI MRI dataset (b=1500 s/mm2) using CNNs particularly DenseNet121, seems promising and requires further validation.
Mostafa Atallah, Rebekah Herrman
Variational quantum circuits for image classification suffer from barren plateaus, while quantum kernel methods scale quadratically with dataset size. We propose an iterative framework based on Quadratic Unconstrained Binary Optimization (QUBO) for training the classifier head of convolutional neural networks (CNNs) via quantum annealing, entirely avoiding gradient-based circuit optimization. Following the Extreme Learning Machine paradigm, convolutional filters are randomly initialized and frozen, and only the fully connected layer is optimized. At each iteration, a convex quadratic surrogate derived from the feature Gram matrix replaces the non-quadratic cross-entropy loss, yielding an iteration-stable curvature proxy. A per-output decomposition splits the $C$-class problem into $C$ independent QUBOs, each with $(d+1)K$ binary variables, where $d$ is the feature dimension and $K$ is the bit precision, so that problem size depends on the image resolution and bit precision, not on the number of training samples. We evaluate the method on six image-classification benchmarks (sklearn digits, MNIST, Fashion-MNIST, CIFAR-10, EMNIST, KMNIST). A precision study shows that accuracy improves monotonically with bit resolution, with 10 bits representing a practical minimum for effective optimization; the 15-bit formulation remains within the qubit and coupler limits of current D-Wave Advantage hardware. The 20-bit formulation matches or exceeds classical stochastic gradient descent on MNIST, Fashion-MNIST, and EMNIST, while remaining competitive on CIFAR-10 and KMNIST. All experiments use simulated annealing, establishing a baseline for direct deployment on quantum annealing hardware.
Authors' comments: 28 pages, 5 figures, 9 tables. Submitted to Quantum Machine Intelligence
Rahul Gupta, Rushikesh Sonawane, Shabnam Iyyani, D. Frederiks, Judith Racusin, Tanmoy Chattopadhayay, A. J. Castro-Tirado, A. F. Valeev et al.
We investigate the spectro-polarimetric properties of the long-duration GRB~220107A, which exhibited two distinct emission episodes separated by a 40 s quiescent gap, to test whether such multi-episode bursts show evidence for evolution in their underlying radiation mechanisms. We analyzed prompt emission data from AstroSat/CZTI, Fermi/GBM, and Konus-Wind, performing spectro-polarimetric analysis for each emission episode. The time-integrated polarization analysis shows no significant detection (PF$ < 38 \%$, $2σ$). Time-resolved analysis reveals clear spectral evolution between the two episodes, with episode 1 exhibiting a hard low-energy photon index and episode 2 showing substantial spectral softening ($α\sim -0.72$). Regarding polarization: Episode 1 shows a low polarization upper limit (< 52\%), consistent with expectations for photospheric emission dominated by quasi-thermal Comptonization in a baryon-rich outflow. Episode 2 also shows overall low polarization (PF$ < 55 \%$, $2σ$), though sliding-window analysis yields a marginally elevated signal (PF$= 70 \pm 30\%$, BF = 2.8) between T0+76 to T0+88 s. The robust spectral softening between episodes could arise from sub-photospheric dissipation, optically thin synchrotron radiation in small-scale magnetic fields, or if the tentative polarization enhancement proves intrinsic, it would favor synchrotron emission in large-scale ordered magnetic fields. The spectral evolution of GRB 220107A, combined with our polarimetric constraints, demonstrates the diagnostic potential of time-resolved spectro-polarimetry for constraining GRB prompt emission physics. We present GRB 220107A as a test case illustrating how future higher sensitivity observations could discriminate between competing emission models for multi-episode bursts. Our results emphasize both the promise and current limitations of prompt phase polarimetry.
Authors' comments: 16 pages, 10 figures, 6 tables, accepted for the publication in Astronomy and Astrophysics
Huajie Chen, Tianqing Zhu, Hailin Yang, Yuchen Zhong, Yang Zhang, Hui Sun, Heng Xu, Zuobin Ying et al.
Watermarking has emerged as a key defense against the misuse of machine-generated images (MGIs). Yet the robustness of these protections remains underexplored. To reveal the limits of SOTA proactive image watermarking defenses, we propose HIDE&SEEK (HS), a suite of versatile and cost-effective attacks that reliably remove embedded watermarks while preserving high visual fidelity.
Hainan Xu, Vladimir Bataev, Travis M. Bartley, Jagadeesh Balam
We propose Chunk-wise Attention Transducer (CHAT), a novel extension to RNN-T models that processes audio in fixed-size chunks while employing cross-attention within each chunk. This hybrid approach maintains RNN-T's streaming capability while introducing controlled flexibility for local alignment modeling. CHAT significantly reduces the temporal dimension that RNN-T must handle, yielding substantial efficiency improvements: up to 46.2% reduction in peak training memory, up to 1.36X faster training, and up to 1.69X faster inference. Alongside these efficiency gains, CHAT achieves consistent accuracy improvements over RNN-T across multiple languages and tasks -- up to 6.3% relative WER reduction for speech recognition and up to 18.0% BLEU improvement for speech translation. The method proves particularly effective for speech translation, where RNN-T's strict monotonic alignment hurts performance. Our results demonstrate that the CHAT model offers a practical solution for deploying more capable streaming speech models without sacrificing real-time constraints.
Authors' comments: Accepted at ICASSP 2026
Haoran Wang, Guoxi Huang, Fan Zhang, David Bull, Nantheera Anantrasirichai
Recent significant advances in 3D scene representation have been driven by 3D Gaussian Splatting (3DGS), which has enabled real-time rendering with photorealistic quality. 3DGS often requires a large number of primitives to achieve high fidelity, leading to redundant representations and high resource consumption, thereby limiting its scalability for complex or large-scale scenes. Consequently, effective pruning strategies and more expressive primitives that can reduce redundancy while preserving visual quality are crucial for practical deployment. We propose an efficient, integrated reconstruction-aware pruning strategy that adaptively determines pruning timing and refining intervals based on reconstruction quality, thus reducing model size while enhancing rendering quality. Moreover, we introduce a 3D Difference-of-Gaussians primitive that jointly models both positive and negative densities in a single primitive, improving the expressiveness of Gaussians under compact configurations. Our method significantly improves model compactness, achieving up to 90\% reduction in Gaussian-count while delivering visual quality that is similar to, or in some cases better than, that produced by state-of-the-art methods. Code will be made publicly available.
Authors' comments: CVPR2026
Xiang Li, Nan Jiang, Yuheng Zhang
We investigate the theoretical aspects of offline reinforcement learning (RL) under general function approximation. While prior works (e.g., Xie et al., 2021) have established the theoretical foundations of learning a good policy from offline data via pessimism, existing algorithms that are computationally tractable (often in an oracle-efficient sense), such as PSPI, only apply to finite and small action spaces. Moreover, these algorithms rely on state-wise mirror descent and require actors to be implicitly induced from the critic functions, failing to accommodate standalone policy parameterization which is ubiquitous in practice. In this work, we address these limitations and extend the theoretical guarantees to parameterized policy classes over large or continuous action spaces. When extending mirror descent to parameterized policies, we identify contextual coupling as the core difficulty, and show how connecting mirror descent to natural policy gradient leads to novel analyses, guarantees, and algorithmic insights, including a surprising unification between offline RL and imitation learning.
Jiayang Meng, Tao Huang, Chen Hou, Guolong Zheng, Hong Chen
In Embedding-as-an-Interface (EaaI) settings, pre-trained models are queried for Intermediate Representations (IRs). The distributional properties of IRs can leak training-set membership signals, enabling Membership Inference Attacks (MIAs) whose strength varies across layers. Although Differentially Private Stochastic Gradient Descent (DP-SGD) mitigates such leakage, existing implementations employ per-example gradient clipping and a uniform, layer-agnostic noise multiplier, ignoring heterogeneous layer-wise MIA vulnerability. This paper introduces Layer-wise MIA-risk-aware DP-SGD (LM-DP-SGD), which adaptively allocates privacy protection across layers in proportion to their MIA risk. Specifically, LM-DP-SGD trains a shadow model on a public shadow dataset, extracts per-layer IRs from its train/test splits, and fits layer-specific MIA adversaries, using their attack error rates as MIA-risk estimates. Leveraging the cross-dataset transferability of MIAs, these estimates are then used to reweight each layer's contribution to the globally clipped gradient during private training, providing layer-appropriate protection under a fixed noise magnitude. We further establish theoretical guarantees on both privacy and convergence of LM-DP-SGD. Extensive experiments show that, under the same privacy budget, LM-DP-SGD reduces the peak IR-level MIA risk while preserving utility, yielding a superior privacy-utility trade-off.
Haodong Chen, Shengyao Zhuang, Zheng Yao, Guido Zuccon, Teerapong Leelanupab
Zero-shot document re-ranking with Large Language Models (LLMs) has evolved from Pointwise methods to Listwise and Setwise approaches that optimize computational efficiency. Despite their success, these methods predominantly rely on generative scoring or output logits, which face bottlenecks in inference latency and result consistency. In-Context Re-ranking (ICR) has recently been proposed as an $O(1)$ alternative method. ICR extracts internal attention signals directly, avoiding the overhead of text generation. However, existing ICR methods simply aggregate signals across all layers; layer-wise contributions and their consistency across architectures have been left unexplored. Furthermore, no unified study has compared internal attention with traditional generative and likelihood-based mechanisms across diverse ranking frameworks under consistent conditions.
In this paper, we conduct an orthogonal evaluation of generation, likelihood, and internal attention mechanisms across multiple ranking frameworks. We further identify a universal "bell-curve" distribution of relevance signals across transformer layers, which motivates the proposed Selective-ICR strategy that reduces inference latency by 30%-50% without compromising effectiveness. Finally, evaluation on the reasoning-intensive BRIGHT benchmark shows that precisely capturing high-quality in-context attention signals fundamentally reduces the need for model scaling and reinforcement learning: a zero-shot 8B model matches the performance of 14B reinforcement-learned re-rankers, while even a 0.6B model outperforms state-of-the-art generation-based approaches. These findings redefine the efficiency-effectiveness frontier for LLM-based re-ranking and highlight the latent potential of internal signals for complex reasoning ranking tasks. Our code and results are publicly available at https://github.com/ielab/Selective-ICR.
Authors' comments: 10 pages, 5 figures, 1 table. Code available at https://github.com/ielab/Selective-ICR
Minqiu Sun, Xin Huang, Luanzheng Guo, Nathan R. Tallent, Kento Sato, Dong Dai
Checkpointing is essential for fault tolerance in training large language models (LLMs). However, existing methods, regardless of their I/O strategies, periodically store the entire model and optimizer states, incurring substantial storage overhead and resource contention. Recent studies reveal that updates across LLM layers are highly non-uniform. Across training steps, some layers may undergo more significant changes, while others remain relatively stable or even unchanged. This suggests that selectively checkpointing only layers with significant updates could reduce overhead without harming training. Implementing such selective strategies requires fine-grained control over both weights and optimizer states, which no current tool provides. To address this gap, we propose \texttt{LLMTailor}, a checkpoint-merging framework that filters and assembles layers from different checkpoints to form a composite checkpoint. Our evaluation indicates that LLMTailor can work with different selective checkpointing strategies and effectively reduce checkpoint size (e.g., 4.3 times smaller for Llama3.1-8B) and checkpoint time (e.g., 2.8 times faster for Qwen2.5-7B) while maintaining model quality.
Authors' comments: 9 pages, 3 figures, accepted at PDSW'25
Sifei Li, Yang Li, Zizhou Wang, Yuxin Zhang, Fuzhang Wu, Oliver Deussen, Tong-Yee Lee, Weiming Dong
Cover songs constitute a vital aspect of musical culture, preserving the core melody of an original composition while reinterpreting it to infuse novel emotional depth and thematic emphasis. Although prior research has explored the reinterpretation of instrumental music through melody-conditioned text-to-music models, the task of cover song generation remains largely unaddressed. In this work, we reformulate our cover song generation as a conditional generation, which simultaneously generates new vocals and accompaniment conditioned on the original vocal melody and text prompts. To this end, we present SongEcho, which leverages Instance-Adaptive Element-wise Linear Modulation (IA-EiLM), a framework that incorporates controllable generation by improving both conditioning injection mechanism and conditional representation. To enhance the conditioning injection mechanism, we extend Feature-wise Linear Modulation (FiLM) to an Element-wise Linear Modulation (EiLM), to facilitate precise temporal alignment in melody control. For conditional representations, we propose Instance-Adaptive Condition Refinement (IACR), which refines conditioning features by interacting with the hidden states of the generative model, yielding instance-adaptive conditioning. Additionally, to address the scarcity of large-scale, open-source full-song datasets, we construct Suno70k, a high-quality AI song dataset enriched with comprehensive annotations. Experimental results across multiple datasets demonstrate that our approach generates superior cover songs compared to existing methods, while requiring fewer than 30% of the trainable parameters. The code, dataset, and demos are available at https://github.com/lsfhuihuiff/SongEcho_ICLR2026.
Authors' comments: Accepted at ICLR 2026. 21 pages (10 pages main text), 5 figures
Hongxuan Wu, Yukun Zhang, Xueqing Zhou
When a multimodal Transformer answers a visual question, is the prediction driven by visual evidence, linguistic reasoning, or genuinely fused cross-modal computation -- and how does this structure evolve across layers? We address this question with a layer-wise framework based on Partial Information Decomposition (PID) that decomposes the predictive information at each Transformer layer into redundant, vision-unique, language-unique, and synergistic components. To make PID tractable for high-dimensional neural representations, we introduce \emph{PID Flow}, a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation. Applying this framework to LLaVA-1.5-7B and LLaVA-1.6-7B across six GQA reasoning tasks, we uncover a consistent \emph{modal transduction} pattern: visual-unique information peaks early and decays with depth, language-unique information surges in late layers to account for roughly 82\% of the final prediction, and cross-modal synergy remains below 2\%. This trajectory is highly stable across model variants (layer-wise correlations $>$0.96) yet strongly task-dependent, with semantic redundancy governing the detailed information fingerprint. To establish causality, we perform targeted Image$\rightarrow$Question attention knockouts and show that disrupting the primary transduction pathway induces predictable increases in trapped visual-unique information, compensatory synergy, and total information cost -- effects that are strongest in vision-dependent tasks and weakest in high-redundancy tasks. Together, these results provide an information-theoretic, causal account of how vision becomes language in multimodal Transformers, and offer quantitative guidance for identifying architectural bottlenecks where modality-specific information is lost.
Mickaël Li, Nan Zeng, Liangyu Deng, Mingzhou Jiang, Chang Wu, Honghui He
Extracellular matrix (ECM) constitutes a key basement structure to human organisms by acting as a complex network of large proteins and carbohydrates that provide structural support to surrounding cells. Remodeling in the extracellular matrix's structural fibers leads to insight into the development of diseases such as cancer, fibrosis and carcinoma. While standard tissues visualization in the ECM involves multiple lengthy histopathological staining protocols, Mueller matrix-based polarimetry provides label-free tissue slices' microstructural information and optical properties. This work aims to identify three types of fiber tissues commonly found in the ECM of gastrointestinal tissue specimens by analyzing their polarization properties. To address decomposition methods' reliance on restrictive hypotheses and inability with an individual polarization-based parameter to determine the nature of a given biological tissue; this study employs Uniform Manifold Approximation and Projection (UMAP) method to offer greater discriminative power and flexibility. Subsequently, polarization-based features will be extracted and compared between fiber regions statistically to discern potential diagnostic differences. By providing colorized images, this work aims to demonstrate the feasibility of distinguishing different fibers with polarization approach, offering insights for future clinical development while complementing existing staining methods for pathological tissue specimens.
Lamine Rihani
Artificial intelligence/machine learning (AI/ML) systems and emerging quantum computing software present unprecedented testing challenges characterized by high-dimensional/continuous input spaces, probabilistic/non-deterministic output distributions, behavioral correctness defined exclusively over observable prediction behaviors and measurement outcomes, and critical quality dimensions, trustworthiness, fairness, calibration, robustness, error syndrome patterns, that manifest through complex multi-way interactions among semantically meaningful output properties rather than deterministic input-output mappings. This paper introduces reverse n-wise output testing, a mathematically principled paradigm inversion that constructs covering arrays directly over domain-specific output equivalence classes, ML confidence calibration buckets, decision boundary regions, fairness partitions, embedding clusters, ranking stability bands, quantum measurement outcome distributions (0-dominant, 1-dominant, superposition collapse), error syndrome patterns (bit-flip, phase-flip, correlated errors), then solves the computationally challenging black-box inverse mapping problem via gradient-free metaheuristic optimization to synthesize input feature configurations or quantum circuit parameters capable of eliciting targeted behavioral signatures from opaque models. The framework delivers synergistic benefits across both domains: explicit customer-centric prediction/measurement coverage guarantees, substantial improvements in fault detection rates for ML calibration/boundary failures and quantum error syndromes, enhanced test suite efficiency, and structured MLOps/quantum validation pipelines with automated partition discovery from uncertainty analysis and coverage drift monitoring.
Abhinav Shukla, Nachiket Tapas
Deep neural networks (DNNs) have achieved remarkable success in object detection tasks, but their increasing complexity poses significant challenges for deployment on resource-constrained platforms. While model compression techniques such as pruning have emerged as essential tools, traditional magnitude-based pruning methods do not necessarily align with the true functional contribution of network components to task-specific performance. In this work, we present an explainability-inspired, layer-wise pruning framework tailored for efficient object detection. Our approach leverages a SHAP-inspired gradient--activation attribution to estimate layer importance, providing a data-driven proxy for functional contribution rather than relying solely on static weight magnitudes. We conduct comprehensive experiments across diverse object detection architectures, including ResNet-50, MobileNetV2, ShuffleNetV2, Faster R-CNN, RetinaNet, and YOLOv8, evaluating performance on the Microsoft COCO 2017 validation set. The results show that the proposed attribution-inspired pruning consistently identifies different layers as least important compared to L1-norm-based methods, leading to improved accuracy--efficiency trade-offs. Notably, for ShuffleNetV2, our method yields a 10\% empirical increase in inference speed, whereas L1-pruning degrades performance by 13.7\%. For RetinaNet, the proposed approach preserves the baseline mAP (0.151) with negligible impact on inference speed, while L1-pruning incurs a 1.3\% mAP drop for a 6.2\% speed increase. These findings highlight the importance of data-driven layer importance assessment and demonstrate that explainability-inspired compression offers a principled direction for deploying deep neural networks on edge and resource-constrained platforms while preserving both performance and interpretability.
Hamed Heidari-Gorji, Raquel Gil Rodriguez, Karl R. Gegenfurtner
We previously investigated color constancy in photorealistic virtual reality (VR) and developed a Deep Neural Network (DNN) that predicts reflectance from rendered images. Here, we combine both approaches to compare and study a model and human performance with respect to established color constancy mechanisms: local surround, maximum flux and spatial mean. Rather than evaluating the model against physical ground truth, model performance was assessed using the same achromatic object selection task employed in the human experiments. The model, a ResNet based U-Net from our previous work, was pre-trained on rendered images to predict surface reflectance. We then applied transfer learning, fine-tuning only the network's decoder on images from the baseline VR condition. To parallel the human experiment, the model's output was used to perform the same achromatic object selection task across all conditions. Results show a strong correspondence between the model and human behavior. Both achieved high constancy under baseline conditions and showed similar, condition-dependent performance declines when the local surround or spatial mean color cues were removed.
Yongkang Jin, Jianwen Luo, Jingjing Wang, Jianmin Yao, Yu Hong
Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.