Lee R. Martin, Andrew W. Blain, Tanio Díaz-Santos, Roberto J. Assef, Chao-Wei Tsai, Hyunsung D. Jun, Peter R. M. Eisenhardt, Jingwen Wu et al.
We present observations of mid-J J=4-3 or J=5-4 carbon monoxide (CO) emission
lines and continuum emission from a sample of ten of the most luminous
log(L/L_solar)~14 Hot Dust-Obscured Galaxies (Hot DOGs) discovered by the
Wide-field Infrared Survey Explorer (WISE) with redshifts up to 4.6. We uncover
broad spectral lines (FWHM~400 km/s) in these objects, suggesting a turbulent
molecular interstellar medium (ISM) may be ubiquitous in Hot DOGs. A halo of
molecular gas, extending out to a radius of 5 kpc is observed in W2305-0039,
likely supplied by 940 km/s molecular outflows. W0831+0140 is plausibly the
host of a merger between at least two galaxies, consistent with observations
made using ionized gas. These CO(4-3) observations contrast with previous
CO(1-0) studies of the same sources: the CO(4-3) to CO(1-0) luminosity ratios
exceed 300 in each source, suggesting that the lowest excited states of CO are
underluminous. These findings show that the molecular gas in Hot DOGs is
consistently turbulent, plausibly a consequence of AGN feedback, triggered by
galactic mergers.
Authors' comments: 19 pages (16 main text & 3 in Appendix), 9 figures, plus 3 in
Appendix. MNRAS in press
Tianyuan Zhang, Lu Wang, Jiaqi Kang, Xinwei Zhang, Siyuan Liang, Yuwei Chen, Aishan Liu, Xianglong Liu
Recent advances in deep learning have markedly improved autonomous driving
(AD) models, particularly end-to-end systems that integrate perception,
prediction, and planning stages, achieving state-of-the-art performance.
However, these models remain vulnerable to adversarial attacks, where
human-imperceptible perturbations can disrupt decision-making processes. While
adversarial training is an effective method for enhancing model robustness
against such attacks, no prior studies have focused on its application to
end-to-end AD models. In this paper, we take the first step in adversarial
training for end-to-end AD models and present a novel Module-wise Adaptive
Adversarial Training (MA2T). However, extending conventional adversarial
training to this context is highly non-trivial, as different stages within the
model have distinct objectives and are strongly interconnected. To address
these challenges, MA2T first introduces Module-wise Noise Injection, which
injects noise before the input of different modules, targeting training models
with the guidance of overall objectives rather than each independent module
loss. Additionally, we introduce Dynamic Weight Accumulation Adaptation, which
incorporates accumulated weight changes to adaptively learn and adjust the loss
weights of each module based on their contributions (accumulated reduction
rates) for better balance and robust training. To demonstrate the efficacy of
our defense, we conduct extensive experiments on the widely-used nuScenes
dataset across several end-to-end AD models under both white-box and black-box
attacks, where our method outperforms other baselines by large margins
(+5-10%). Moreover, we validate the robustness of our defense through
closed-loop evaluation in the CARLA simulation environment, showing improved
resilience even against natural corruption.
Authors' comments: 14 pages
Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li
Deploying large language models (LLMs) on edge devices presents significant challenges due to the substantial computational overhead and memory requirements. Activation sparsification can mitigate these resource challenges by reducing the number of activated neurons during inference. Existing methods typically employ thresholding-based sparsification based on the statistics of activation tensors. However, they do not model the impact of activation sparsification on performance, resulting in suboptimal performance degradation. To address the limitations, this paper reformulates the activation sparsification problem to explicitly capture the relationship between activation sparsity and model performance. Then, this paper proposes CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification. First, channel-wise thresholding assigns a unique threshold to each activation channel in the feed-forward network (FFN) layers. Then, selective sparsification involves applying thresholding-based activation sparsification to specific layers within the attention modules. Finally, we detail the implementation of sparse kernels to accelerate LLM inference. Experimental results demonstrate that the proposed CHESS achieves lower performance degradation over eight downstream tasks while activating fewer parameters than existing methods, thus speeding up the LLM inference by up to 1.27x.
Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding
Top-k algorithms are essential in various applications, from high-performance computing and information retrieval to big data and neural network model training. This paper introduces RTop-K, a highly efficient parallel row-wise top-k selection algorithm designed for GPUs. RTop-K employs a Binary Search-based approach to optimize resource allocation and provides a scalable solution that significantly accelerates top-k operations. We perform a theoretical analysis of the effects of early stopping in our algorithm, demonstrating that it maintains the accuracy of neural network models while enhancing performance. Comprehensive tests show that our GPU implementation of RTop-K outperforms other row-wise top-k GPU implementations, with minimal impact on testing accuracy when early stopping is applied. Notably, RTop-K achieves speed increases ranging from 4.245$\times$ to 9.506$\times$ with early stopping, and 3.936$\times$ without early stopping, compared to state-of-the-art implementations. The proposed methods offer significant improvements in the training and inference of Graph Neural Networks (GNNs), addressing critical challenges in latency and throughput on GPU platforms.
Haofeng Liu, Emad Alsusa, Arafat Al-Dweik
This paper investigates the bit error rate (BER) and outage probability performance of integrated sensing and communication (ISaC) in uplink non-orthogonal multiple access (NOMA) based Internet of Things (IoT) systems. Specifically, we consider an ISaC system where the radar signal is designed to be orthogonal to the communication signal over two symbol periods so that its interference on the communication signal is completely eliminated when detecting the data in pairs of consecutive symbols. This is akin to multi-symbol rate NOMA systems except in this case as the radar bears no data, its waveform is manipulated to be orthogonal to the transmitted communication signal. To eliminate potential decision ambiguity during the pair-wise data detection, a constant phase-offset between adjacent communication symbols is applied at the transmitter. The performance of such a system is analyzed through deriving analytical expressions for the exact BER of zero-forcing (ZF) based receivers. In addition, close-form expressions for the upper BER bound and the outage probability for both ZF and the joint maximum likelihood (JML) receivers are presented. The results show that the derived expressions are perfectly matched with the simulation results. The obtained expressions provide an insight into the performance of this novel ISaC system including demonstrating the impact of various parameters and showing how the ZF receiver provides a useful trade-off between performance and complexity relative to the JML receiver.
Zizheng Huang, Haoxing Chen, Jiaqi Li, Jun Lan, Huijia Zhu, Weiqiang Wang, Limin Wang
Recent Vision Mamba models not only have much lower complexity for processing higher resolution images and longer videos but also the competitive performance with Vision Transformers (ViTs). However, they are stuck into overfitting and thus only present up to base size (about 80M). It is still unclear how vanilla Vision Mamba (Vim) can be efficiently scaled up to larger sizes, which is essentially for further exploitation. In this paper, we propose a stochastic layer-wise shuffle regularization, which empowers successfully scaling non-hierarchical Vision Mamba to a large size (about 300M) in a supervised setting. Specifically, our base and large-scale ShuffleMamba models can outperform the supervised ViTs of similar size by 0.8\% and 1.0\% classification accuracy on ImageNet1k, respectively, without auxiliary data. When evaluated on the ADE20K semantic segmentation and COCO detection tasks, our ShuffleMamba models also show significant improvements. Without bells and whistles, the stochastic layer-wise shuffle has the following highlights: (1) \textit{Plug and play:} it does not change model architectures and will be omitted in inference. (2) \textit{Simple but effective:} it can improve the overfitting in Vim training and only introduce random token permutation operations. (3) \textit{Intuitive:} the token sequences in deeper layers are more likely to be shuffled as they are expected to be more semantic and less sensitive to patch positions. Code and models will be available at https://github.com/huangzizheng01/ShuffleMamba.
Sergey Karpov, Oleg Malkov, Alexandra Avdeeva
Sixty years after the discovery of brown dwarfs, the search for these objects
continues, particularly in the vicinity of the Sun. Objects near the Sun are
characterized by large proper motions, making them seen as fast-moving objects.
While the Gaia DR3 catalogue is a comprehensive source of proper motions, it
lacks the depth needed for discovering fainter objects. Modern multi-epoch
surveys, with their greater depth, offer a new opportunity for systematic
search for ultra-cool dwarfs. The study aims to systematically search for high
proper motion objects using the newly released catalogue of epochal WISE data
in order to identify new brown dwarf candidates in the solar neighborhood,
estimate their spectral types, distances and spatial velocities. We used
recently released unTimely catalogue of epochal detections in unWISE coadds to
search for objects with high proper motions using simple motion detection
algorithm. This method was used to identify objects with proper motions
exceeding approximately 0.6 arcseconds per year. The identified objects were
then cross-referenced with data from other large-scale sky surveys to further
analyze their characteristics. The search yielded 3245 moving objects with
significant proper motions, 32 of which had not been previously published.
Among these, at least 15 were identified as reliable new brown dwarf
candidates, with estimated distances closer than 50 parsecs and spectral types
later than T0.
Authors' comments: Table 1 is available online at https://zenodo.org/records/13362690.
Submitted to A&A
Langrui Zhou, Guang Li
The current mainstream multi-modal medical image-to-image translation methods
face a contradiction. Supervised methods with outstanding performance rely on
pixel-wise aligned training data to constrain the model optimization. However,
obtaining pixel-wise aligned multi-modal medical image datasets is challenging.
Unsupervised methods can be trained without paired data, but their reliability
cannot be guaranteed. At present, there is no ideal multi-modal medical
image-to-image translation method that can generate reliable translation
results without the need for pixel-wise aligned data. This work aims to develop
a novel medical image-to-image translation model that is independent of
pixel-wise aligned data (MITIA), enabling reliable multi-modal medical
image-to-image translation under the condition of misaligned training data. The
proposed MITIA model utilizes a prior extraction network composed of a
multi-modal medical image registration module and a multi-modal misalignment
error detection module to extract pixel-level prior information from training
data with misalignment errors to the largest extent. The extracted prior
information is then used to construct a regularization term to constrain the
optimization of the unsupervised cycle-consistent GAN model, restricting its
solution space and thereby improving the performance and reliability of the
generator. We trained the MITIA model using six datasets containing different
misalignment errors and two well-aligned datasets. Subsequently, we compared
the proposed method with six other state-of-the-art image-to-image translation
methods. The results of both quantitative analysis and qualitative visual
inspection indicate that MITIA achieves superior performance compared to the
competing state-of-the-art methods, both on misaligned data and aligned data.
Authors' comments: This paper has been accepted as a research article by Medical Physics
Zhikai Li, Xuewen Liu, Dongrong Joe Fu, Jianquan Li, Qingyi Gu, Kurt Keutzer, Zhen Dong
The rapid advancement of visual generative models necessitates efficient and
reliable evaluation methods. Arena platform, which gathers user votes on model
comparisons, can rank models with human preferences. However, traditional Arena
methods, while established, require an excessive number of comparisons for
ranking to converge and are vulnerable to preference noise in voting,
suggesting the need for better approaches tailored to contemporary evaluation
challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable
platform based on a key insight: images and videos possess higher perceptual
intuitiveness than texts, enabling rapid evaluation of multiple samples
simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing
K models to engage in free-for-all competitions, which yield much richer
information than pairwise comparisons. To enhance the robustness of the system,
we leverage probabilistic modeling and Bayesian updating techniques. We propose
an exploration-exploitation-based matchmaking strategy to facilitate more
informative comparisons. In our experiments, K-Sort Arena exhibits 16.3x faster
convergence compared to the widely used ELO algorithm. To further validate the
superiority and obtain a comprehensive leaderboard, we collect human feedback
via crowdsourced evaluations of numerous cutting-edge text-to-image and
text-to-video models. Thanks to its high efficiency, K-Sort Arena can
continuously incorporate emerging models and update the leaderboard with
minimal votes. Our project has undergone several months of internal testing and
is now available at https://huggingface.co/spaces/ksort/K-Sort-Arena
Authors' comments: CVPR 2025. Project page:
https://huggingface.co/spaces/ksort/K-Sort-Arena
Antón de la Fuente, Dan Jurafsky
This study asks how self-supervised speech models represent suprasegmental
categories like Mandarin lexical tone, English lexical stress, and English
phrasal accents. Through a series of probing tasks, we make layer-wise
comparisons of English and Mandarin 12 layer monolingual models. Our findings
suggest that 1) English and Mandarin wav2vec 2.0 models learn contextual
representations of abstract suprasegmental categories which are strongest in
the middle third of the network. 2) Models are better at representing features
that exist in the language of their training data, and this difference is
driven by enriched context in transformer blocks, not local acoustic
representation. 3) Fine-tuned wav2vec 2.0 improves performance in later layers
compared to pre-trained models mainly for lexically contrastive features like
tone and stress, 4) HuBERT and WavLM learn similar representations to wav2vec
2.0, differing mainly in later layer performance. Our results extend previous
understanding of how models represent suprasegmentals and offer new insights
into the language-specificity and contextual nature of these representations.
Authors' comments: 4 pages, 3 figures, to be published in Interspeech 2024 proceedings
Mirko Nardi, Lorenzo Valerio, Andrea Passarella
Federated Learning (FL) is a pivotal approach in decentralized machine learning, especially when data privacy is crucial and direct data sharing is impractical. While FL is typically associated with supervised learning, its potential in unsupervised scenarios is underexplored. This paper introduces a novel unsupervised federated learning methodology designed to identify the complete set of categories (global K) across multiple clients within label-free, non-uniform data distributions, a process known as Federated Clustering. Our approach, Federated Cluster-Wise Refinement (FedCRef), involves clients that collaboratively train models on clusters with similar data distributions. Initially, clients with diverse local data distributions (local K) train models on their clusters to generate compressed data representations. These local models are then shared across the network, enabling clients to compare them through reconstruction error analysis, leading to the formation of federated groups.In these groups, clients collaboratively train a shared model representing each data distribution, while continuously refining their local clusters to enhance data association accuracy. This iterative process allows our system to identify all potential data distributions across the network and develop robust representation models for each. To validate our approach, we compare it with traditional centralized methods, establishing a performance baseline and showcasing the advantages of our distributed solution. We also conduct experiments on the EMNIST and KMNIST datasets, demonstrating FedCRef's ability to refine and align cluster models with actual data distributions, significantly improving data representation precision in unsupervised federated settings.
Zhanzhong Pang, Fadime Sener, Shrinivas Ramasubramanian, Angela Yao
Procedural activity videos often exhibit a long-tailed action distribution
due to varying action frequencies and durations. However, state-of-the-art
temporal action segmentation methods overlook the long tail and fail to
recognize tail actions. Existing long-tail methods make class-independent
assumptions and struggle to identify tail classes when applied to temporal
segmentation frameworks. This work proposes a novel group-wise temporal logit
adjustment~(G-TLA) framework that combines a group-wise softmax formulation
while leveraging activity information and action ordering for logit adjustment.
The proposed framework significantly improves in segmenting tail actions
without any performance loss on head actions.
Authors' comments: Accepted by ECCV 2024
Jingcai Guo, Zhijie Rao, Zhi Chen, Song Guo, Jingren Zhou, Dacheng Tao
Zero-shot image recognition (ZSIR) aims to recognize and reason in unseen
domains by learning generalized knowledge from limited data in the seen domain.
The gist of ZSIR is constructing a well-aligned mapping between the input
visual space and the target semantic space, which is a bottom-up paradigm
inspired by the process by which humans observe the world. In recent years,
ZSIR has witnessed significant progress on a broad spectrum, from theory to
algorithm design, as well as widespread applications. However, to the best of
our knowledge, there remains a lack of a systematic review of ZSIR from an
element-wise perspective, i.e., learning fine-grained elements of data and
their inferential associations. To fill the gap, this paper thoroughly
investigates recent advances in element-wise ZSIR and provides a sound basis
for its future development. Concretely, we first integrate three basic ZSIR
tasks, i.e., object recognition, compositional recognition, and foundation
model-based open-world recognition, into a unified element-wise paradigm and
provide a detailed taxonomy and analysis of the main approaches. Next, we
summarize the benchmarks, covering technical implementations, standardized
datasets, and some more details as a library. Last, we sketch out related
applications, discuss vital challenges, and suggest potential future
directions.
Authors' comments: 20 pages, 6 figures, and 4 tables
Pengxiang Zhao, Hanyu Hu, Ping Li, Yi Zheng, Zhefeng Wang, Xiaoming Yuan
Pruning is a critical strategy for compressing trained large language models (LLMs), aiming at substantial memory conservation and computational acceleration without compromising performance. However, existing pruning methods often necessitate inefficient retraining for billion-scale LLMs or rely on heuristic methods such as the optimal brain surgeon framework, which degrade performance. In this paper, we introduce FISTAPruner, the first post-training pruner based on convex optimization models and algorithms. Specifically, we propose a convex optimization model incorporating $\ell_1$ norm to induce sparsity and utilize the FISTA solver for optimization. FISTAPruner incorporates an intra-layer cumulative error correction mechanism and supports parallel pruning. We comprehensively evaluate FISTAPruner on models such as OPT, LLaMA, LLaMA-2, and LLaMA-3 with 125M to 70B parameters under unstructured and 2:4 semi-structured sparsity, demonstrating superior performance over existing state-of-the-art methods across various language benchmarks.
Wenhao Li, Jie Zhou, Chuan Luo, Chao Tang, Kun Zhang, Shixiong Zhao
In the realm of modern mobile E-commerce, providing users with nearby
commercial service recommendations through location-based online services has
become increasingly vital. While machine learning approaches have shown promise
in multi-scene recommendation, existing methodologies often struggle to address
cold-start problems in unprecedented scenes: the increasing diversity of
commercial choices, along with the short online lifespan of scenes, give rise
to the complexity of effective recommendations in online and dynamic scenes. In
this work, we propose Scene-wise Adaptive Network (SwAN), a novel approach that
emphasizes high-performance cold-start online recommendations for new scenes.
Our approach introduces several crucial capabilities, including scene
similarity learning, user-specific scene transition cognition, scene-specific
information construction for the new scene, and enhancing the diverged logical
information between scenes. We demonstrate SwAN's potential to optimize dynamic
multi-scene recommendation problems by effectively online handling cold-start
recommendations for any newly arrived scenes. More encouragingly, SwAN has been
successfully deployed in Meituan's online catering recommendation service,
which serves millions of customers per day, and SwAN has achieved a 5.64% CTR
index improvement relative to the baselines and a 5.19% increase in daily order
volume proportion.
Authors' comments: 10 pages, 6 figures, accepted by Recsys 2024
Zijian Wang, Bin Wang, Haifeng Jing, Huayu Li, Hongbo Dou
Recent years, multi-hop reasoning has been widely studied for knowledge graph
(KG) reasoning due to its efficacy and interpretability. However, previous
multi-hop reasoning approaches are subject to two primary shortcomings. First,
agents struggle to learn effective and robust policies at the early phase due
to sparse rewards. Second, these approaches often falter on specific datasets
like sparse knowledge graphs, where agents are required to traverse lengthy
reasoning paths. To address these problems, we propose a multi-hop reasoning
model with dual agents based on hierarchical reinforcement learning (HRL),
which is named FULORA. FULORA tackles the above reasoning challenges by
eFficient GUidance-ExpLORAtion between dual agents. The high-level agent walks
on the simplified knowledge graph to provide stage-wise hints for the low-level
agent walking on the original knowledge graph. In this framework, the low-level
agent optimizes a value function that balances two objectives: (1) maximizing
return, and (2) integrating efficient guidance from the high-level agent.
Experiments conducted on three real-word knowledge graph datasets demonstrate
that FULORA outperforms RL-based baselines, especially in the case of
long-distance reasoning.
Authors' comments: Accepted by AAAI-25
Chenyan Liu, Yufan Cai, Yun Lin, Yuhuan Huang, Yunrui Pei, Bo Jiang, Ping Yang, Jin Song Dong et al.
Recent years have seen the development of LLM-based code generation. Compared
to generating code in a software project, incremental code edits are
empirically observed to be more frequent. The emerging code editing approaches
usually formulate the problem as generating an edit based on known relevant
prior edits and context. However, practical code edits can be more complicated.
First, an editing session can include multiple (ir)relevant edits to the code
under edit. Second, the inference of the subsequent edits is non-trivial as the
scope of its ripple effect can be the whole project. In this work, we propose
CoEdPilot, an LLM-driven solution to recommend code edits by discriminating the
relevant edits, exploring their interactive natures, and estimating its ripple
effect in the project. Specifically, CoEdPilot orchestrates multiple neural
transformers to identify what and how to edit in the project regarding both
edit location and edit content. When a user accomplishes an edit with an
optional editing description, a Subsequent Edit Analysis first reports the most
relevant files in the project with what types of edits (e.g., keep, insert, and
replace) can happen for each line of their code. Next, an Edit-content
Generator generates concrete edit options for the lines of code, regarding its
relevant prior changes reported by an Edit-dependency Analyzer. Lastly, both
the Subsequent Edit Analysis and the Edit-content Generator capture relevant
prior edits as feedback to readjust their recommendations. We train our models
by collecting over 180K commits from 471 open-source projects in 5 programming
languages. Our extensive experiments show that CoEdPilot can well predict the
edits (i.e., predicting edit location with an accuracy of 70.8%-85.3%, and the
edit content with an exact match rate of 41.8% and BLEU4 score of 60.7)...
Authors' comments: 13 pages, 7 figures
Seungeun Oh, Sihun Baek, Jihong Park, Hyelin Nam, Praneeth Vepakomma, Ramesh Raskar, Mehdi Bennis, Seong-Lyun Kim
In computer vision, the vision transformer (ViT) has increasingly superseded
the convolutional neural network (CNN) for improved accuracy and robustness.
However, ViT's large model sizes and high sample complexity make it difficult
to train on resource-constrained edge devices. Split learning (SL) emerges as a
viable solution, leveraging server-side resources to train ViTs while utilizing
private data from distributed devices. However, SL requires additional
information exchange for weight updates between the device and the server,
which can be exposed to various attacks on private training data. To mitigate
the risk of data breaches in classification tasks, inspired from the CutMix
regularization, we propose a novel privacy-preserving SL framework that injects
Gaussian noise into smashed data and mixes randomly chosen patches of smashed
data across clients, coined DP-CutMixSL. Our analysis demonstrates that
DP-CutMixSL is a differentially private (DP) mechanism that strengthens privacy
protection against membership inference attacks during forward propagation.
Through simulations, we show that DP-CutMixSL improves privacy protection
against membership inference attacks, reconstruction attacks, and label
inference attacks, while also improving accuracy compared to DP-SL and
DP-MixSL.
Authors' comments: 23 pages, 11 figures, 8 tables, to be published in Transactions on
Machine Learning Research (TMLR)
Lukas Kratochvila, Gijs de Jong, Monique Arkesteijn, Simon Bilik, Tomas Zemcik, Karel Horak, Jan S. Rellermeyer
Digital twins have a major potential to form a significant part of urban management in emergency planning, as they allow more efficient designing of the escape routes, better orientation in exceptional situations, and faster rescue intervention. Nevertheless, creating the twins still remains a largely manual effort, due to a lack of 3D-representations, which are available only in limited amounts for some new buildings. Thus, in this paper we aim to synthesize 3D information from commonly available 2D architectural floor plans. We propose two novel pixel-wise segmentation methods based on the MDA-Unet and MACU-Net architectures with improved skip connections, an attention mechanism, and a training objective together with a reconstruction part of the pipeline, which vectorizes the segmented plans to create a 3D model. The proposed methods are compared with two other state-of-the-art techniques and several benchmark datasets. On the commonly used CubiCasa benchmark dataset, our methods have achieved the mean F1 score of 0.86 over five examined classes, outperforming the other pixel-wise approaches tested. We have also made our code publicly available to support research in the field.
Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Masato Mimura, Takatomo Kano, Atsunori Ogawa, Marc Delcroix
This paper introduces a novel approach called sentence-wise speech
summarization (Sen-SSum), which generates text summaries from a spoken document
in a sentence-by-sentence manner. Sen-SSum combines the real-time processing of
automatic speech recognition (ASR) with the conciseness of speech
summarization. To explore this approach, we present two datasets for Sen-SSum:
Mega-SSum and CSJ-SSum. Using these datasets, our study evaluates two types of
Transformer-based models: 1) cascade models that combine ASR and strong text
summarization models, and 2) end-to-end (E2E) models that directly convert
speech into a text summary. While E2E models are appealing to develop
compute-efficient models, they perform worse than cascade models. Therefore, we
propose knowledge distillation for E2E models using pseudo-summaries generated
by the cascade models. Our experiments show that this proposed knowledge
distillation effectively improves the performance of the E2E model on both
datasets.
Authors' comments: Accepted to Interspeech2024. Dataset:
https://huggingface.co/datasets/komats/mega-ssum