Baolei Zhang, Haoran Xin, Yuxi Chen, Zhuqing Liu, Biao Yi, Tong Li, Lihai Nie, Zheli Liu et al.
Retrieval-Augmented Generation (RAG) integrates external knowledge into large
language models to improve response quality. However, recent work has shown
that RAG systems are highly vulnerable to poisoning attacks, where malicious
texts are inserted into the knowledge database to influence model outputs.
While several defenses have been proposed, they are often circumvented by more
adaptive or sophisticated attacks.
This paper presents RAGOrigin, a black-box responsibility attribution
framework designed to identify which texts in the knowledge database are
responsible for misleading or incorrect generations. Our method constructs a
focused attribution scope tailored to each misgeneration event and assigns a
responsibility score to each candidate text by evaluating its retrieval
ranking, semantic relevance, and influence on the generated response. The
system then isolates poisoned texts using an unsupervised clustering method. We
evaluate RAGOrigin across seven datasets and fifteen poisoning attacks,
including newly developed adaptive poisoning strategies and multi-attacker
scenarios. Our approach outperforms existing baselines in identifying poisoned
content and remains robust under dynamic and noisy conditions. These results
suggest that RAGOrigin provides a practical and effective solution for tracing
the origins of corrupted knowledge in RAG systems.
Authors' comments: To appear in the IEEE Symposium on Security and Privacy, 2026
Hao Yin, Xin Man, Feiyu Chen, Jie Shao, Heng Tao Shen
Text-to-Image Person Retrieval (TIPR) is a cross-modal matching task that aims to retrieve the most relevant person images based on a given text query. The key challenge in TIPR lies in achieving effective alignment between textual and visual modalities within a common latent space. To address this challenge, prior approaches incorporate attention mechanisms for implicit cross-modal local alignment. However, they lack the ability to verify whether all local features are correctly aligned. Moreover, existing methods primarily focus on hard negative samples during model updates, with the goal of refining distinctions between positive and negative pairs, often neglecting incorrectly matched positive pairs. To alleviate these issues, we propose FMFA, a cross-modal Full-Mode Fine-grained Alignment framework, which enhances global matching through explicit fine-grained alignment and existing implicit relational reasoning -- hence the term ``full-mode" -- without requiring additional supervision. Specifically, we design an Adaptive Similarity Distribution Matching (A-SDM) module to rectify unmatched positive sample pairs. A-SDM adaptively pulls the unmatched positive pairs closer in the joint embedding space, thereby achieving more precise global alignment. Additionally, we introduce an Explicit Fine-grained Alignment (EFA) module, which makes up for the lack of verification capability of implicit relational reasoning. EFA strengthens explicit cross-modal fine-grained interactions by sparsifying the similarity matrix and employs a hard coding method for local alignment. Our proposed method is evaluated on three public datasets, achieving state-of-the-art performance among all global matching methods. Our code is available at https://github.com/yinhao1102/FMFA.
Lin Zhu, Lingwei Kong, Xin Ning, Xiaoyang Qu, Jianzong Wang
Private Information Retrieval (PIR) is a fundamental cryptographic primitive
that enables users to retrieve data from a database without revealing which
item is being accessed, thereby preserving query privacy. However, PIR
protocols also face the challenge of result verifiability, as users expect the
reconstructed data to be trustworthy and authentic. In this work, we propose
two effective constructions of publicly verifiable PIR (PVPIR) in the
multi-server setting, which achieve query privacy, correctness, and
verifiability simultaneously. We further present three concrete instantiations
based on these constructions. For the point query, our protocol introduces
minimal computational overhead and achieves strong verifiability guarantees
with significantly lower communication costs compared to existing Merkle
tree-based approaches. For the predicate query, the communication complexity of
our scheme remains stable as the database size increases, demonstrating strong
scalability and suitability for large-scale private query applications.
Authors' comments: Accepted by the 21st International Conference on Information Security
and Cryptology (Inscrypt2025)
Amanda Chan, James Jiayu Liu, He Kai, Onno P. Kampman
Access to reliable mental health information is vital for early help-seeking,
yet expanding knowledge bases is resource-intensive and often misaligned with
user needs. This results in poor performance of retrieval systems when
presented concerns are not covered or expressed in informal or contextualized
language. We present an AI-based gap-informed framework for corpus augmentation
that authentically identifies underrepresented topics (gaps) by overlaying
naturalistic user data such as forum posts in order to prioritize expansions
based on coverage and usefulness. In a case study, we compare Directed
(gap-informed augmentations) with Non-Directed augmentation (random additions),
evaluating the relevance and usefulness of retrieved information across four
retrieval-augmented generation (RAG) pipelines. Directed augmentation achieved
near-optimal performance with modest expansions--requiring only a 42% increase
for Query Transformation, 74% for Reranking and Hierarchical, and 318% for
Baseline--to reach ~95% of the performance of an exhaustive reference corpus.
In contrast, Non-Directed augmentation required substantially larger and thus
practically infeasible expansions to achieve comparable performance (232%,
318%, 403%, and 763%, respectively). These results show that strategically
targeted corpus growth can reduce content creation demands while sustaining
high retrieval and provision quality, offering a scalable approach for building
trusted health information repositories and supporting generative AI
applications in high-stakes domains.
Authors' comments: 25 pages, 3 figures, submitted to NeurIPS 2025 GenAI4Health
Yongye Su, Zeya Zhang, Jane Kou, Cheng Ju, Shubhojeet Sarkar, Yamin Wang, Ji Liu, Shengbo Guo
Beyond general web-scale search, social network search uniquely enables users
to retrieve information and discover potential connections within their social
context. We introduce a framework of modernized Facebook Group Scoped Search by
blending traditional keyword-based retrieval with embedding-based retrieval
(EBR) to improve the search relevance and diversity of search results. Our
system integrates semantic retrieval into the existing keyword search pipeline,
enabling users to discover more contextually relevant group posts. To
rigorously assess the impact of this blended approach, we introduce a novel
evaluation framework that leverages large language models (LLMs) to perform
offline relevance assessments, providing scalable and consistent quality
benchmarks. Our results demonstrate that the blended retrieval system
significantly enhances user engagement and search quality, as validated by both
online metrics and LLM-based evaluation. This work offers practical insights
for deploying and evaluating advanced retrieval systems in large-scale,
real-world social platforms.
Authors' comments: 5 Pages, work done as Yongye Su's internship project at Meta
Hanbin Ko, Gihun Cho, Inhyeok Baek, Donguk Kim, Joonbeom Koo, Changi Kim, Dongheon Lee, Chang Min Park
Vision-language pretraining has advanced image-text alignment, yet progress in radiology remains constrained by the heterogeneity of clinical reports, including abbreviations, impression-only notes, and stylistic variability. Unlike general-domain settings where more data often leads to better performance, naively scaling to large collections of noisy reports can plateau or even degrade model learning. We ask whether large language model (LLM) encoders can provide robust clinical representations that transfer across diverse styles and better guide image-text alignment. We introduce LLM2VEC4CXR, a domain-adapted LLM encoder for chest X-ray reports, and LLM2CLIP4CXR, a dual-tower framework that couples this encoder with a vision backbone. LLM2VEC4CXR improves clinical text understanding over BERT-based baselines, handles abbreviations and style variation, and achieves strong clinical alignment on report-level metrics. LLM2CLIP4CXR leverages these embeddings to boost retrieval accuracy and clinically oriented scores, with stronger cross-dataset generalization than prior medical CLIP variants. Trained on 1.6M CXR studies from public and private sources with heterogeneous and noisy reports, our models demonstrate that robustness -- not scale alone -- is the key to effective multimodal learning. We release models to support further research in medical image-text representation learning.
Authors' comments: 24 pages, 2 figures, under review
Zechao Liu, Zheng Zhou, Xiangkun Chen, Tao Liang, Dapeng Lang
Deep hashing models have been widely adopted to tackle the challenges of large-scale image retrieval. However, these approaches face serious security risks due to their vulnerability to adversarial examples. Despite the increasing exploration of targeted attacks on deep hashing models, existing approaches still suffer from a lack of multimodal guidance, reliance on labeling information and dependence on pixel-level operations for attacks. To address these limitations, we proposed DiffHash, a novel diffusion-based targeted attack for deep hashing. Unlike traditional pixel-based attacks that directly modify specific pixels and lack multimodal guidance, our approach focuses on optimizing the latent representations of images, guided by text information generated by a Large Language Model (LLM) for the target image. Furthermore, we designed a multi-space hash alignment network to align the high-dimension image space and text space to the low-dimension binary hash space. During reconstruction, we also incorporated text-guided attention mechanisms to refine adversarial examples, ensuring them aligned with the target semantics while maintaining visual plausibility. Extensive experiments have demonstrated that our method outperforms state-of-the-art (SOTA) targeted attack methods, achieving better black-box transferability and offering more excellent stability across datasets.
Zihan Wang, Zihan Liang, Zhou Shao, Yufei Ma, Huangyu Dai, Ben Chen, Lingtao Mao, Chenyi Lei et al.
Retrieval-Augmented Generation (RAG) has emerged as a promising approach to
address key limitations of Large Language Models (LLMs), such as hallucination,
outdated knowledge, and lacking reference. However, current RAG frameworks
often struggle with identifying whether retrieved documents meaningfully
contribute to answer generation. This shortcoming makes it difficult to filter
out irrelevant or even misleading content, which notably impacts the final
performance. In this paper, we propose Document Information Gain (DIG), a novel
metric designed to quantify the contribution of retrieved documents to correct
answer generation. DIG measures a document's value by computing the difference
of LLM's generation confidence with and without the document augmented.
Further, we introduce InfoGain-RAG, a framework that leverages DIG scores to
train a specialized reranker, which prioritizes each retrieved document from
exact distinguishing and accurate sorting perspectives. This approach can
effectively filter out irrelevant documents and select the most valuable ones
for better answer generation. Extensive experiments across various models and
benchmarks demonstrate that InfoGain-RAG can significantly outperform existing
approaches, on both single and multiple retrievers paradigm. Specifically on
NaturalQA, it achieves the improvements of 17.9%, 4.5%, 12.5% in exact match
accuracy against naive RAG, self-reflective RAG and modern ranking-based RAG
respectively, and even an average of 15.3% increment on advanced proprietary
model GPT-4o across all datasets. These results demonstrate the feasibility of
InfoGain-RAG as it can offer a reliable solution for RAG in multiple
applications.
Authors' comments: EMNLP'25 Oral Presentation. Contact: benchen4395@gmail.com
Anu Pradhan, Alexandra Ortan, Apurv Verma, Madhavan Seshadri
The evaluation bottleneck in recommendation systems has become particularly
acute with the rise of Generative AI, where traditional metrics fall short of
capturing nuanced quality dimensions that matter in specialized domains like
legal research. Can we trust Large Language Models to serve as reliable judges
of their own kind? This paper investigates LLM-as-a-Judge as a principled
approach to evaluating Retrieval-Augmented Generation systems in legal
contexts, where the stakes of recommendation quality are exceptionally high.
We tackle two fundamental questions that determine practical viability: which
inter-rater reliability metrics best capture the alignment between LLM and
human assessments, and how do we conduct statistically sound comparisons
between competing systems? Through systematic experimentation, we discover that
traditional agreement metrics like Krippendorff's alpha can be misleading in
the skewed distributions typical of AI system evaluations. Instead, Gwet's AC2
and rank correlation coefficients emerge as more robust indicators for judge
selection, while the Wilcoxon Signed-Rank Test with Benjamini-Hochberg
corrections provides the statistical rigor needed for reliable system
comparisons.
Our findings suggest a path toward scalable, cost-effective evaluation that
maintains the precision demanded by legal applications, transforming what was
once a human-intensive bottleneck into an automated, yet statistically
principled, evaluation framework.
Authors' comments: Accepted in EARL 25: The 2nd Workshop on Evaluating and Applying
Recommender Systems with Large Language Models at RecSys 2025
Veronika Oehl, Alexander Damm
Sun-induced fluorescence (SIF) as a close remote sensing based proxy for
photosynthesis is accepted as a useful measure to remotely monitor vegetation
health and gross primary productivity. In this work we present the new
retrieval method WAFER (WAvelet decomposition FluorEscence Retrieval) based on
wavelet decompositions of the measured spectra of reflected radiance as well as
a reference radiance not containing fluorescence. By comparing absolute
absorption line depths by means of the corresponding wavelet coefficients, a
relative reflectance is retrieved independently of the fluorescence, i.e.
without introducing a coupling between reflectance and fluorescence. The
fluorescence can then be derived as the remaining offset. This method can be
applied to arbitrary chosen wavelength windows in the whole spectral range,
such that all the spectral data available is exploited, including the
separation into several frequency (i.e. width of absorption lines) levels and
without the need of extensive training datasets. At the same time, the
assumptions about the reflectance shape are minimal and no spectral shape
assumptions are imposed on the fluorescence, which not only avoids biases
arising from wrong or differing fluorescence models across different spatial
scales and retrieval methods but also allows for the exploration of this
spectral shape for different measurement setups. WAFER is tested on a synthetic
dataset as well as several diurnal datasets acquired with a field spectrometer
(FloX) over an agricultural site. We compare the WAFER method to two
established retrieval methods, namely the improved Fraunhofer line
discrimination (iFLD) method and spectral fitting method (SFM) and find a good
agreement with the added possibility of exploring the true spectral shape of
the offset signal and free choice of the retrieval window. (abbreviated)
Authors' comments: 20 pages, 13 figures. Published in Remote Sensing of Environment
(2023)
Piyushkumar Patel
E-Commerce customer support requires quick and accurate answers grounded in product data and past support cases. This paper develops a novel retrieval-augmented generation (RAG) framework that uses knowledge graphs (KGs) to improve the relevance of the answer and the factual grounding. We examine recent advances in knowledge-augmented RAG and chatbots based on large language models (LLM) in customer support, including Microsoft's GraphRAG and hybrid retrieval architectures. We then propose a new answer synthesis algorithm that combines structured subgraphs from a domain-specific KG with text documents retrieved from support archives, producing more coherent and grounded responses. We detail the architecture and knowledge flow of our system, provide comprehensive experimental evaluation, and justify its design in real-time support settings. Our implementation demonstrates 23\% improvement in factual accuracy and 89\% user satisfaction in e-Commerce QA scenarios.
Qingsong Wang, Tao Wu, Wang Lin, Yueying Feng, Gongsheng Yuan, Chang Yao, Jingyuan Chen
Large Language Models (LLMs) have demonstrated strong performance in open-ended generation tasks. However, they often struggle to adapt content to users with differing cognitive capacities, leading to a phenomenon we term cognitive misalignment. This issue arises in two forms: knowledge-level misalignment, where content is too complex or too simplistic relative to user understanding, and presentation-style misalignment, where the structure or tone hinders effective comprehension. To address these challenges, we propose the Cognitive-Level Alignment Framework (CLAF), a general-purpose generation framework that aligns both knowledge complexity and presentation style with user cognition. CLAF integrates a capability-aware retrieval module based on a hierarchical knowledge graph and a style optimization module guided by Bloom's taxonomy and preference learning. Additionally, a knowledge-controllable generation component ensures consistency and relevance throughout the output. To support training and evaluation, we construct SCALE, a cognitively annotated dataset containing responses at multiple comprehension levels per query. Empirical results show that CLAF enhances the adaptability and informativeness of LLM outputs across a range of user profiles, offering a robust solution to cognitive-level alignment in real-world applications.
Authors' comments: Accepted to Findings of EMNLP 2026
Zahra Borhani, Ali Ebrahimpour-Boroojeny, Francisco R. Ortega
Augmented Reality (AR) enables intuitive interaction with virtual annotations overlaid on the real world, supporting a wide range of applications such as remote assistance, education, and industrial training. However, as the number of heterogeneous annotations increases, their efficient retrieval remains an open challenge in 3D environments. This paper examines how interaction modalities and presentation designs affect user performance, workload, fatigue, and preference in AR annotation retrieval. In two user studies, we compare eye-gaze versus hand-ray hovering and evaluate four presentation methods: Opacity-based, Scale-based, Nothing-based, and Marker-based. Results show that eye-gaze was favored over hand-ray by users, despite leading to significantly higher unintentional activations. Among the presentation methods, Scale-based presentation reduces workload and task completion time while aligning with user preferences. Our findings offer empirical insights into the effectiveness of different annotation presentation methods, leading to design recommendations for building more efficient and user-friendly AR annotation review systems.
Waikit Xiu, Qiang Lu, Xiying Li, Chen Hu, Shengbo Sun
As intelligent transportation systems advance, traffic video understanding plays an increasingly pivotal role in comprehensive scene perception and causal analysis. Yet, existing approaches face notable challenges in accurately modeling spatiotemporal causality and integrating domain-specific knowledge, limiting their effectiveness in complex scenarios. To address these limitations, we propose Traffic-MLLM, a multimodal large language model tailored for fine-grained traffic analysis. Built on the Qwen2.5-VL backbone, our model leverages high-quality traffic-specific multimodal datasets and uses Low-Rank Adaptation (LoRA) for lightweight fine-tuning, significantly enhancing its capacity to model continuous spatiotemporal features in video sequences. Furthermore, we introduce an innovative knowledge prompting module fusing Chain-of-Thought (CoT) reasoning with Retrieval-Augmented Generation (RAG), enabling precise injection of detailed traffic regulations and domain knowledge into the inference process. This design markedly boosts the model's logical reasoning and knowledge adaptation capabilities. Experimental results on TrafficQA and DriveQA benchmarks show Traffic-MLLM achieves state-of-the-art performance, validating its superior ability to process multimodal traffic data. It also exhibits remarkable zero-shot reasoning and cross-scenario generalization capabilities.
Thomas Y. Chen
We establish the first information-theoretic limits for multimodal retrieval.
Casting ranking as lossy source coding, we derive a single-letter
rate-distortion function $R(D)$ for reciprocal-rank distortion and prove a
converse bound that splits into a modality-balanced term plus a skew penalty
$\kappa\,\Delta H$ capturing entropy imbalance and cross-modal redundancy. We
then construct an explicit entropy-weighted stochastic quantizer with an
adaptive, per-modality temperature decoder; a Blahut-Arimoto argument shows
this scheme achieves distortion within $O(n^{-1})$ of $R(D)$ using $n$ training
triples. A VC-type analysis yields the first finite-sample excess-risk bound
whose complexity scales sub-linearly in both the number of modalities and the
entropy gap. Experiments on controlled Gaussian mixtures and Flickr30k confirm
that our adaptive codes sit within two percentage points of the theoretical
frontier, while fixed-temperature and naive CLIP baselines lag significantly.
Taken together, our results give a principled answer to "how many bits per
query are necessary" for high-quality multimodal retrieval and provide design
guidance for entropy-aware contrastive objectives, continual-learning
retrievers, and retrieval-augmented generators.
Authors' comments: ICCV MRR 2025
Xinyu Zhang, Pei Zhang, Shuang Luo, Jialong Tang, Yu Wan, Baosong Yang, Fei Huang
Cultural competence, defined as the ability to understand and adapt to
multicultural contexts, is increasingly vital for large language models (LLMs)
in global environments. While several cultural benchmarks exist to assess LLMs'
cultural competence, current evaluations suffer from fragmented taxonomies,
domain specificity, and heavy reliance on manual data annotation. To address
these limitations, we introduce CultureSynth, a novel framework comprising (1)
a comprehensive hierarchical multilingual cultural taxonomy covering 12 primary
and 130 secondary topics, and (2) a Retrieval-Augmented Generation (RAG)-based
methodology leveraging factual knowledge to synthesize culturally relevant
question-answer pairs. The CultureSynth-7 synthetic benchmark contains 19,360
entries and 4,149 manually verified entries across 7 languages. Evaluation of
14 prevalent LLMs of different sizes reveals clear performance stratification
led by ChatGPT-4o-Latest and Qwen2.5-72B-Instruct. The results demonstrate that
a 3B-parameter threshold is necessary for achieving basic cultural competence,
models display varying architectural biases in knowledge processing, and
significant geographic disparities exist across models. We believe that
CultureSynth offers a scalable framework for developing culturally aware AI
systems while reducing reliance on manual annotation\footnote{Benchmark is
available at https://github.com/Eyr3/CultureSynth.}.
Authors' comments: Accepted as a Findings paper at EMNLP 2025
Xiao Liang, Zhen Qin, Zhihui Zhu, Shuang Li
This paper presents a geometric analysis of the simultaneous blind
deconvolution and phase retrieval (BDPR) problem via a structured low-rank
tensor recovery framework. Due to the highly complicated structure of the
associated sensing tensor, directly characterizing its optimization landscape
is intractable. To address this, we introduce a tensor sensing problem as a
tractable surrogate that preserves the essential structural features of the
target low-rank tensor while enabling rigorous theoretical analysis. As a first
step toward understanding this surrogate model, we study the corresponding
population risk, which captures key aspects of the underlying low-rank tensor
structure. We characterize the global landscape of the population risk on the
unit sphere and show that Riemannian gradient descent (RGD) converges linearly
under mild conditions. We then extend the analysis to the tensor sensing
problem, establishing local geometric properties, proving convergence
guarantees for RGD, and quantifying robustness under measurement noise. Our
theoretical results are further supported by extensive numerical experiments.
These findings offer foundational insights into the optimization landscape of
the structured low-rank tensor recovery problem, which equivalently
characterizes the original BDPR problem, thereby providing principled guidance
for solving the original BDPR problem.
Authors' comments: 17 pages, 18 figures
Meng Huang, Jinming Wen, Ran Zhang
The PhaseLift algorithm is an effective convex method for solving the phase retrieval problem from Fourier measurements with coded diffraction patterns (CDP). While exact reconstruction guarantees are well-established in the noiseless case, the stability of recovery under noise remains less well understood. In particular, when the measurements are corrupted by an additive noise vector $\mathbf{w} \in \mathbb{R}^m$, existing recovery bounds scale on the order of $\|\mathbf{w}\|_2$, which is conjectured to be suboptimal. More recently, Soltanolkotabi conjectured that the optimal PhaseLift recovery bound should scale with the average noise magnitude, that is, on the order of $\|\mathbf{w}\|_2/\sqrt m$. However, establishing this theoretically is considerably more challenging and has remained an open problem. In this paper, we focus on this conjecture and provide a nearly optimal recovery bound for it. We prove that under adversarial noise, the recovery error of PhaseLift is bounded by $O(\log n \cdot \|\mathbf{w}\|_2/\sqrt m)$, and further show that there exists a noise vector for which the error lower bound exceeds $O\bigl(\frac{1}{\sqrt{\log n}} \cdot \frac{\|\mathbf{w}\|_2}{\sqrt m}\bigr)$. Here, $n$ is the dimension of the signals we aim to recover. Moreover, for mean-zero sub-Gaussian noise vector $\mathbf{w} \in \mathbb R^m$ with sub-Gaussian norm $\sigma$, we establish a bound of order $O\bigl(\sigma \sqrt{\frac{n \log^4 n}{m}}\bigr)$, and also provide a corresponding minimax lower bound. Our results affirm Soltanolkotabi's conjecture up to logarithmic factors, providing a new insight into the stability of PhaseLift under noisy CDP measurements.
Pengcheng Jiang, Siru Ouyang, Yizhu Jiao, Ming Zhong, Runchu Tian, Jiawei Han
Large Language Models (LLMs) have revolutionized natural language processing with their remarkable capabilities in text generation and reasoning. However, these models face critical challenges when deployed in real-world applications, including hallucination generation, outdated knowledge, and limited domain expertise. Retrieval And Structuring (RAS) Augmented Generation addresses these limitations by integrating dynamic information retrieval with structured knowledge representations. This survey (1) examines retrieval mechanisms including sparse, dense, and hybrid approaches for accessing external knowledge; (2) explore text structuring techniques such as taxonomy construction, hierarchical classification, and information extraction that transform unstructured text into organized representations; and (3) investigate how these structured representations integrate with LLMs through prompt-based methods, reasoning frameworks, and knowledge embedding techniques. It also identifies technical challenges in retrieval efficiency, structure quality, and knowledge integration, while highlighting research opportunities in multimodal retrieval, cross-lingual structures, and interactive systems. This comprehensive overview provides researchers and practitioners with insights into RAS methods, applications, and future directions.
Authors' comments: KDD'25 survey track
Wenhao Yang, Jianguo Wei, Wenhuan Lu, Xinyue Song, Xianghu Yue
Image retrieval using spoken language cues has emerged as a promising
direction in multimodal perception, yet leveraging speech in multi-speaker
scenarios remains challenging. We propose a novel Target Speaker Speech-Image
Retrieval task and a framework that learns the relationship between images and
multi-speaker speech signals in the presence of a target speaker. Our method
integrates pre-trained self-supervised audio encoders with vision models via
target speaker-aware contrastive learning, conditioned on a Target Speaker
Extraction and Retrieval module. This enables the system to extract spoken
commands from the target speaker and align them with corresponding images.
Experiments on SpokenCOCO2Mix and SpokenCOCO3Mix show that TSRE significantly
outperforms existing methods, achieving 36.3% and 29.9% Recall@1 in 2 and 3
speaker scenarios, respectively - substantial improvements over single speaker
baselines and state-of-the-art models. Our approach demonstrates potential for
real-world deployment in assistive robotics and multimodal interaction systems.
Authors' comments: 5 pages, 2 figures